In this paper, we propose a new video representation incorporating image based deep features and an efficient pooling strategy for the purpose of action recognition. The Convolutional Neural Network (CNN) based features have very recently emerged as the new state of the art for image classification. Several attempts have been made to extend such CNN models for videos by explicitly focusing on the temporal evolution of the frames. Feature pooling is one such approach which represents video sequences in terms of some statistical properties of the feature dimensions over the frames. However, traditional pooling strategies including max or average pooling explicitly fail to capture the temporal progression of the frame-level contents. In contrast to previous pooling techniques, we propose a two-level video representation which separately focuses on the entire video as well as a number of video sub-volumes. In both levels, we introduce a generic time series pooling on the frame-level deep CNN features efficiently. Further, a self-tuning spectral clustering is considered to highlight video snippets which are highly probable to contain significant sub-action sequences. We validate the proposed feature encoding on the challenging KTH-actions and UCF-50 datasets and find that the proposed encoding outperforms traditional pooling based feature representations by substantial margin.