The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Automatic phonetic reconstruction of medical dictations from non- literal and automatically recognized speech transcripts leads to closer-to-literal transcripts for training. In this paper, we introduce an extended alignment method assessing multiple levels of text segmentation and show how open issues like wrong segmentation in the recognized transcript can be resolved. Furthermore, the effect of...
This paper proposes a speech segment selection method based on machine learning for concatenative speech synthesis systems. The proposed method has two novel features. One is its use of support vector machine (SVM) to estimate the subjective correctness of pitch accent with respect to each accent phrase of possible candidate speech segments. The other is its use of a determination function to identify...
This paper describes a novel approach to the context clustering process in a speaker independent HMM-based Thai speech synthesis for improvement of the tone intelligibility of the average voice and also the speaker adapted voice. A couple of phrase intonation features from a generative model including a baseline value of fundamental frequency and a phrase command amplitude are extracted and thereafter...
In the literature many intonation models are trained using parameters extracted sentence-by-sentence on contours interpolated in the unvoiced segments. This may introduce a bias in the final parameters and a reduction of the generalization of the model due to the increased dispersion of them. Recently, we have proposed JEMA, a joint extraction and prediction approach for intonation modeling that avoids...
One of the biggest challenges in emotional speech resynthesis is the selection of modification parameters that will make humans perceive a targeted emotion. The best selection method is by using human raters. However, for large evaluation sets this process can be very costly. In this paper, we describe a recognition for synthesis (RFS) system to automatically select a set of possible parameter values...
Corpus-based concatenative speech synthesis is very popular these days due to its highly natural speech quality. The amount of computation required in the run time, however, is often quite large and various approaches have been proposed for reducing this runtime computation. In this paper, we propose early stopping schemes for Viterbi beam search in the unit selection, with which we can stop early...
Due to the inconsistency between the maximum likelihood (ML) based training and the synthesis application in HMM-based speech synthesis, a minimum generation error (MGE) criterion had been proposed for HMM training. This paper continues to apply the MGE criterion to model adaptation for HMM-based speech synthesis. We propose a MGE linear regression (MGELR) based model adaptation algorithm, where the...
Voice conversion has become more and more important in speech technology, but most of current works have to use parallel utterances of both source and target speaker as the training corpus, which limits the application of the technology. In the paper, we propose a new method of text- independent voice conversion which uses non-parallel corpus for the training. The hidden Markov model (HMM) is used...
This paper presents a new method for the reduction of an existing speech database in order to be used for domain independent embedded unit selection text-to-speech synthesis. The method relies on statistical data produced by the unit selection process on a large text corpus. It utilizes the selection frequency, as well as the actual score of each unit. Both objective and subjective evaluation of the...
This paper aims at investigating the use of sequential clustering for speaker diarization. Conventional diarization systems are based on parametric models and agglomerative clustering. In our previous work we proposed a non-parametric method based on the agglomerative information bottleneck for very fast diarization. Here we consider the combination of sequential and agglomerative clustering for avoiding...
This paper presents a method for modeling the envelope of spectral amplitude parameters of speech signals in "two dimensions" (2D). It consists of two cascaded modelings: the first one along the frequency axis is the usual cepstrum technique, which consists of modeling the log-scaled spectral envelope with a discrete cosine model (DCM). The second one, along the time axis, consists of modeling...
In this work we consider the problem of spectral envelope estimation using spectra with perceptually warped frequency axis. The goal of this work is the reduction of the order of the spectral envelope model which will facilitate the use of these envelopes for training of voice conversion systems. We adapt the true-envelope estimator to Mel-frequency representations and adapt a recently proposed cepstral...
With the development of voice transformation and speech synthesis technologies, speaker identification systems are likely to face attacks from imposters who use voice transformed or synthesized speech to mimic a particular speaker. Therefore, we investigated in this paper how speaker identification systems perform on voice transformed speech. We conducted experiments with two different approaches,...
In this paper we present our argument that context information could be used in early stages i.e., during the definition of mapping of the words into sequence of graphemes. We show that the early tagged contextual graphemes play a significant role in improving the performance of grapheme based speech synthesis and speech recognition systems.
The work presented here shows a comparison between a voice conversion system based on converting only the vocal tract representation of the source speaker and an augmented system that adds an algorithm for estimating the target excitation signal. The estimation algorithm uses a stochastic model for relating the excitation signal to the vocal tract features. The two systems were subjected to objective...
In current voice conversion systems, obtaining a high similarity between converted and target voices requires a high degree of signal manipulation, which implies important quality degradation, up to the point that in some cases the quality scores are unacceptable for real-life applications. Indeed, a tradeoff can be observed between the similarity scores and the quality scores achieved by a given...
The goals of this study were to evaluate the synthesis of visible speech that was based on 3-D motion data using second-order isomorphism. To do this, word stimuli were generated for perceptual discrimination and identification tasks. Discrimination trials were based on word-pairs that were predicted to be at four levels of perceptual dissimilarity. Results from the discrimination tasks indicated...
Due to the inconsistency between the maximum likelihood (ML) based training and the synthesis application in HMM-based speech synthesis, a minimum generation error (MGE) criterion had been proposed for HMM training. This paper continues to apply the MGE criterion to model adaptation for HMM-based speech synthesis. We propose a MGE linear regression (MGELR) based model adaptation algorithm, where the...
This paper presents a minimum unit selection error (MUSE) training method for HMM-based unit selection speech synthesis system, which selects the optimal phone-sized unit sequence from the speech database by maximizing the combined likelihood of a group of trained HMMs. Under MUSE criterion, the weights and distribution parameters of these HMMs are estimated to minimize the number of different units...
A new statistical confidence measure, template constrained posterior (TCP), is proposed for verifying phone transcriptions of speech databases. Different from generalized posterior probability (GPP), TCP is computed by considering string hypotheses that bear a focused unit, e.g., phone with partially matched left and right contexts. Parameters used for TCP include context window length, partial matching...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.