We proposed a framework for human action recognition by learning pose dictionary as the human appearance representation. At first, the shape based pose feature is constructed based on the contour points of the human silhouette and invariant to translation and scaling. After the local pose features are extracted from the original videos, the class-specific dictionaries are learned individually on the training frames of each class. Then the whole pose dictionary is built by concatenating all class-specific dictionaries and the sparse representation by the learned pose dictionary is estimated for every frame in the test video. Finally the test video is allocated with the class with respect to the minimum reconstruction errors of its all frames. Experimental results on Weizmann and MuHAVi-MAS14 dataset demonstrate the effectiveness of our method.