In this paper, we propose an approach to negotiate the gap between low-level image features and the human interpretation of the image. Taking the cue from text-based retrieval techniques, we construct "visual keywords" using vector quantization of small- sized image tiles. Both visual and textual keywords are combined and used to represent an image as a single multimodal vector. We use a diffusion kernel-based non-linear approach to fuse the visual and textual keywords. By comparing the performance of this approach with a low-level features-based approach, we demonstrate that visual keywords, when combined with textual keywords, improve the image retrieval results significantly.