Generative Modeling of Voice Fundamental Frequency Contours

Hirokazu Kameoka; Kota Yoshizato; Tatxsuma Ishihara; Kento Kadowaki; Yasunori Ohishi; Kunio Kashino

doi:10.1109/TASLP.2015.2418576

Generative Modeling of Voice Fundamental Frequency Contours

Kameoka, H., Yoshizato, K., Ishihara, T., Kadowaki, K., Ohishi, Y., Kashino, K.

Source

IEEE/ACM Transactions on Audio, Speech, and Language Processing > 2015 > 23 > 6 > 1042 - 1053

Abstract

This paper introduces a generative model of voice fundamental frequency ( ${F_0}$ ) contours that allows us to extract prosodic features from raw speech data. The present ${F_0}$ contour model is formulated by translating the Fujisaki model, a well-founded mathematical model representing the control mechanism of vocal fold vibration, into a probabilistic model described as a discrete-time stochastic process. There are two motivations behind this formulation. One is to derive a general parameter estimation framework for the Fujisaki model that allows the introduction of powerful statistical methods. The other is to construct an automatically trainable version of the Fujisaki model that we can incorporate into statistical-model-based text-to-speech synthesizers in such a way that the Fujisaki-model parameters can be learned from a speech corpus in a unified manner. It could also be useful for other speech applications such as emotion recognition, speaker identification, speech conversion and dialogue systems, in which prosodic information plays a significant role. We quantitatively evaluated the performance of the proposed Fujisaki model parameter extractor using real speech data. Experimental results revealed that our method was superior to a state-of-the-art Fujisaki model parameter extractor.