Matching pursuit adaptively decomposes signals in a redundant dictionary to achieve some sub-optimal non-orthogonal sparse representations. However, due to the redundancy of the dictionary, MP is usually very time consuming. FFT-based MP implementation runs significantly faster than greedy MP implementation, yet it still may take days to decompose an image on some dictionaries with high redundancy. This paper presents an implementation of FFT-based matching pursuit algorithm on CUDA platform for sparse decomposition of images. We found that FFT based MP presents strong data parallelism, thus suited for implementing on CUDA platform and executed in a parallel way on CUDA-capable GPU devices. Experiments results show that several dozen times of speedup ratio can be easily achieved.