In this paper, we present a novel scheme for object-based video analysis and interpretation based on automatic video object extraction, video object abstraction, and semantic event modeling. In this scheme video objects (VOs) are first automatically extracted, followed by a video object abstraction algorithm for identifying key frames to reduce data redundancy and provide reliable feature data for next stage of the algorithm. After the semantic objects, which are VOs that dominate the semantics in a video shot, a:re identified (automatically or selected by users), all the other objects in the video shot are considered as background. Semantic feature modeling scheme is based on temporal variation of low-level features in semantic object area. More specifically, the general Dynamic Bayesian Network (DBN) is used to characterize the spatial-temporal nature of the semantic objects. Experimental results that demonstrate the effective performance of the proposed approach are also presented.