We propose a novel model-based method for tracking the 6-DOF pose of a very large number of rigid objects in real time. By combining dense motion and depth cues with sparse keypoint correspondences, and by feeding back information from the modeled scene to the cue extraction process, the method is both highly accurate and robust to noise and occlusions. A tight integration of the graphical and computational capability of graphics processing units allows the method to simultaneously track hundreds of objects in real time. We achieve pose updates at framerates around 40 Hz when using 500 000 data samples to track 150 objects using images of resolution $640\times 480$ . We introduce a synthetic benchmark data set with varying objects, background motion, noise, and occlusions that enables the evaluation of stereo-vision-based pose estimators in complex scenarios. Using this data set and a novel evaluation methodology, we show that the proposed method greatly outperforms state-of-the-art methods. Finally, we demonstrate excellent performance on challenging real-world sequences involving multiple objects being manipulated.