We study the problem of scene classification for RGB-D images in this paper. Firstly we analyze the difference between the RGB and depth images. And then based on the difference, an efficient method is implemented to make use of the RGB and depth images and make a well fusion for the RGB and depth features. Focusing on the difference of modality between the RGB and depth images, we propose a method to learn features from color and depth separately using the heterogeneous model. Especially we use the deep ConvNet model with shallow finetuning for RGB images and the relatively shallow ConvNet model with deep finetuning, which can adequately extract different characteristics of the two modalities. After obtaining the discriminative features for each modality, a multiple fully-connected layers connected with a soft-max classifier is trained to harness the complementary relationship between the two modalities. Experimental evaluations on two publicly RGB-D datasets validate the effectiveness of the proposed method.