One of the most fundamental issues for physical agents (humans, primates, and robots) in performing various kinds of tasks is body representation. Especially during tool-use by monkeys, neurophysiological evidence shows that the representation can be dynamically reconstructed by spatio-temporal integration of different sensor modalities so that it can be adaptive to environmental changes. However, to construct such a representation, an issue to be resolved is how to associate which information among various sensory data. This paper presents a method that constructs cross-modal body representation from vision, touch, and proprioception. Tactile sensation, when the robot touches something, triggers the construction process of the visual receptive field for body parts that can be found by visual attention based on a saliency map and consequently regarded as the end effector. Simultaneously, proprioceptive information is associated with this visual receptive field to achieve the cross-modal body representation. The proposed model is applied to a real robot and results comparable to the activities of parietal neurons observed in monkeys are shown.