This paper presents an improved hierarchical bag-of-words model based on local space-time features to generate multi-level features, in which higher-level features are generated by lower-level feature neighborhoods. An improved method is developed to extract low-level local space-time features, in which the concept of video orthogonal planes is introduced, and interest points are detected on video orthogonal planes. A weighted function is utilized to integrate the descriptors in cuboids extracted around interest points to form the lowest level local features. Multi-level features generated by the hierarchical bag-of-words model are combined to represent actions in a video for action recognition. Experimental results carried on KTH and Weizmann datasets demonstrate that our method yield higher recognition rate.