RGB-D action streams have aroused impressive attentions for recognition task, for its geometric characteristic and less influence of illumination. However, there exists large divergences of intra-class actions performed between sub-action, multi-subject and multi-modality, which may affect the result of action recognition. In order to solve these three problems, we propose a Sparse alignment guided Non-negative Tensor Factorization (SaNTF) framework with non-negative tensor factorization (NTF) for subspace learning. SaNTF selects the key frames from two intra-class action sequences by sparse regression, and aligns them to mitigate the diversity in the new tensor subspace. In this paper, the high-dimensional RGB-D action sequence is represented as a third-order tensor to preserve the original spatiotemporal structure, and we employ NTF to find a common tensor subspace for realistic action recognition. The experiments on MSRDailyActivity3D action and MSRPair3D action datasets show the higher accuracy compared with the state-of-the-art temporal alignment methods.