Most local sparse representation models in visual tracking generally contain three components: 1) extracting local descriptors from target region, 2) encoding the extracted local descriptors as mid-level features, 3) aggregating statistics of mid-level features into a signature. Since the last step aggregates only first-order statistics of mid-level features, it is named as First-order Pooling (FP). However, FP lacks highorder statistical information of target. Hence, it couldn't reflect the correlation of features, which leads to poor tracking performance. In this paper, we introduce an appearance model for visual tracking that conducts High-order Pooling (HP) over mid-level features under the framework of sparse coding. Instead of first-order signature, we find that higher-order statistics of mid-level features with additional image information could bring large tracking performance gains. Moreover, a simple but effective updating scheme is adopted to improve the tracker adaptability. Experiments on various challenging videos show that the tracking performance with appearance model using HP is superior to those using FP.