Natural scene text recognition has proved to be challenging due to the unconstrained wild conditions. In this paper, to solve this problem we propose a method which first detects and recognizes characters by utilizing the high performance Convolutional Neural Network (CNN). Then for post-processing, inspired by its success in speech recognition, we employ the efficient and flexible Weight Finite State Transducer (WFST) based word labeling model for incorporation with a lexicon or high order language model. In the experiments we show that the proposed approach can correctly and robustly recognize the text in the scene images and the results for serveral public datasets (ICDAR 2003, SVT and IIIT5K) show comparable or superior performance to the state-of-the-art algorithms.