On the state definition for a trainable excitation model in HMM-based speech synthesis

R. Maia; T. Toda; K. Tokuda; S. Sakai; S. Nakamura

doi:10.1109/ICASSP.2008.4518522

On the state definition for a trainable excitation model in HMM-based speech synthesis

Maia, R., Toda, T., Tokuda, K., Sakai, S., Nakamura, S.

Source

2008 IEEE International Conference on Acoustics, Speech and Signal Processing > 3965 - 3968

Abstract

One of the issues of speech synthesizers based on hidden Markov models concerns the vocoded quality of the synthesized speech. From the principle of analysis-by-synthesis speech coders a trainable excitation model has been proposed to improve naturalness, where the method consists in the design of a set of state-dependent filters in a way to minimize the distortion between residual and synthetic excitation. Although this approach seems successful, state definition still represents an open issue. This paper describes a method for state definition wherein bottom-up clustering is performed on full context decision trees, using the likelihood of the residual database as merging criterion. Experiments have shown that improvement on residual modeling through better filter design can be achieved.