This paper presents an orientation estimate scheme using monocular camera and inertial measurement units (IMUs). Unlike the traditional wearable orientation estimation methods, our proposed approach combines both of these two modalities in a novel pattern. Firstly, two visual correspondences between consecutive frames are selected that not only meet the requirement of descriptor similarity constraint, but satisfy the locality constraints, which is under the assumption that the correspondence will be taken as an inlier if their nearest-neighbor feature-point counterparts are within the predefined thresholds with respect to the objective feature-point counterpart. Secondly, these two selected correspondences from visual sensor and quaternions from inertial sensor are jointly employed to derive the initial body poses. Thirdly, a coarse-to-fine procedure proceeds in removing visual false matches and estimating body poses iteratively using Expectation Maximization (EM). Ultimately, the optimal orientation estimation is achieved. Experimental results validate that our proposed method is effective and well suited for wearable orientation estimate.