This paper presents a novel approach for automatic audio-visual emotion recognition. The audio and visual channels provide complementary information for human emotional states recognition, and we utilize Boltzmann Zippers as model-level fusion to learn intrinsic correlations between the different modalities. We extract effective audio and visual feature streams with different time scales and feed them to two Boltzmann chains respectively. The hidden units of two chains are interconnected. Second-order methods are applied to Boltzmann Zippers to speed up learning and pruning process. Experimental results on audio-visual emotion data collected in Wizard of Oz scenarios demonstrate our approach is promising and outperforms single modal HMM and conventional coupled HMM methods.