The great variability of word pronunciation in spontaneous speech is one of the reasons for the low performance of the present speech recognition systems. The generation of dictionaries which take this variability into account may increase the robustness of such systems. A word pronunciation is a possible phoneme-like sequence that can appear in a real utterance, and represents a possible acoustic production of the word.
In this paper, word pronunciations are modeled using stochastic finite-state automata. The use of such models allows the application of grammatical inference methods and an easy integration with the other knowledge sources. The training samples are obtained from the alignment between the phoneme-like decoding of each training utterance and the corresponding canonical transcription.
The models proposed in this work were applied in a translation-oriented speech task. The improvements achieved by these new models ranged from 2.7 to 0.6 points depending on the language model used.