Seeing is not just done through the eyes, it involves the integration of other modalities such as auditory, proprioceptive and tactile information, to locate objets, persons and also the limbs. We hypothesize that the neural mechanism of gain-field modulation, which is found to process coordinate transform between modalities in the superior colliculus and in the parietal area, plays a key role to build such unified perceptual world. In experiments with a head-neck-eye's robot with a camera and microphones, we study how gain-field modulation in neural networks can serve for transcribing one modality's reference frame into another one (e.g., audio signals into eyes' coordinate). It follows that each modality influences the estimations of the position of a stimulus (multimodal enhancement). This can be used in example for mapping sound signals into retina coordinates for audio-visual speech perception.