In this paper we present a novel approach to object recognition based on image localization and registration applied to the problem of multi-modal interaction in smart-home environments. Typically such environments contain multiple small devices which need to be controlled from a distance. Thus, a major problem in recognizing a specific object is its small size in the image compounded by typically cluttered backgrounds. We therefore resort to recognizing an intended object by first registering the acquired image within the panorama. An environment map is used to recognize potential objects within the userpsilas field of view. Experimental results of using such a multi-modal input system on a running smart home environment are presented, where the benefits of combining visual and verbal inputs are demonstrated.