This paper presents a method which able to integrate audio and visual information for human action scene analysis. The approach is top-down for determining and extracting action scenes in video by analyzing both audio and video data. We proposed a framework for recognizing actions by measuring image and action-based information from video with the following characteristics: feature extraction is done automatically; the method deals with both visual and auditory information, and captures both spatial and temporal characteristics; and the extracted features are natural, in the sense that they are closely related to the human perceptual processing. Our effort was to implementing idea of action identification by extracting syntactic properties of a video such as edge feature extraction, colour distribution, audio and motion vectors. In this paper, we present a simple method for human activity recognition based on a Hidden Markov models (HMMs) for sensing, learning and training the actions. In addition, we used audio visual features to distinguish the human actions and to reach a decision. We describe the use of the model that diagnoses states of a human activity based on events from video. We reviewed the model, present an implementation, and report on experiments to demonstrate the robustness of the framework.