Bag of visual words (BoVW) remains a very competitive representation in the domain of scene classification. In this framework, extracting SIFT descriptors on a dense grid of pixels has shown to lead to a better performance. However, due to the nature of SIFT as an edge-based descriptor, computing SIFT on homogeneous regions might result in non-stable region descriptors. The suggested solution in the literature is ignoring and discarding these regions descriptors from the final image level representation. We argue that homogeneous regions contain valuable scene information if represented appropriately. In such manner, a simple yet effective method to model homogeneous image regions is proposed. We call these models contextual information, where their importance on scene classification is investigated. The final image-level representation is the stacking of feature vectors from homogeneous and non-homogeneous regions. The proposed approach is validated on two de-facto standard databases for scene classification: Fifteen Scene Categories and 67 Indoor Scenes datasets. Experimental results on these two datasets show the effectiveness of the proposed model.