This paper presents a fast and robust framework integrating local features for the matching of multimodal geospatial data (e.g., optical, LiDAR, SAR and map). In the proposed framework, local feature descriptors, such as Histogram of Oriented Gradient (HOG) and Local Self Similarity (LSS), are first extracted for every pixel to form a pixel-wise structural feature representation of an image. Then we define a similarity metric based on the feature representation in frequency domain using the 3 Dimensional Fast Fourier Transform (3DFFT) technique, followed by a template matching scheme to detect control points between multimodal data. The proposed framework is based on the hypothesis that structural similarity between images is preserved across different modalities. The major advantages of this framework include (1) structural similarity representation using pixel-wise feature description and (2) high computational efficiency due to the use of 3DFFT. Experimental results on different types of multimodal geospatial data show more accurate matching performance of the proposed framework than the state-of-the-art methods.