Robust detection of people in video is critical in visual surveillance. In this work we present a framework for robust people detection in highly cluttered scenes with low resolution image sequences. Our model utilises both human appearance and their long-term motion information through a fusion formulated in a Bayesian framework. In particular, we introduce a spatial pyramid Gaussian Mixture approach to model variations of long-term human motion information, which is computed via an improved background modeling using spatial motion constrains. Simultaneously, people appearance is modeled by histograms of oriented gradients. Experiments demonstrate that our method reduces significantly false positive rate compared to that of a state of the art human detector under very challenging lighting condition, occlusion and background clutter.