The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Outlined in this paper is a novel approach to speech dereverberation when an estimate of the source-receiver transfer function is known. It is a two-stage algorithm based on the minimum phase/allpass decomposition of a mixed phase room impulse response (RIR). The reverberant speech is first filtered with the inverse minimum phase component of the RIR. Then a non-negative matrix factorization (NMF)...
In this paper, we propose to use discriminative training (DT) for improving letter-to-sound (LTS) conversion performance. LTS is a critical component in both ASR and TTS for predicting the correct pronunciation of a word not included in the lexicon. For TTS applications, predicting the proper pronunciation of an out-of-vocabulary person/place name, especially a name with foreign origin can be challenging...
We present a framework for speech recognition that accounts for hidden articulatory information. We model the articulatory space using a codebook of articulatory configurations geometrically derived from EMA measurements available in the MOCHA database. The articulatory parameter set we derive is in the form of Maeda parameters. In turn, these parameters are used in a physiologically- motivated articulatory...
Fundamental frequency contours for speech, as obtained by common pitch tracking algorithms, contain a great deal of fine detail that is unlikely to hold much perceptual significance for listeners. In our experiments, a radically reduced pitch contour consisting of a single linear segment for each syllable was found to judged as equally natural as the original pitch track by listeners, based on high-quality...
In this paper, the high-level prosodic patterns of prosodic word (PW), prosodic phrase (PPh) and breath group/prosodic phrase group (BQ/PQ) for syllable pitch-level and duration are explored using an automatic joint prosody labeling and modeling method. Experimental results on a treebank speech corpus showed that the explored high-level prosodic patterns not only matched well with our a priori knowledge...
The level of quality that can be achieved in concatenative text-to-speech synthesis is primarily governed by the inventory of units used in unit selection. This has led to the collection of ever larger corpora in the quest for ever more natural synthetic speech. As operational considerations limit the size of the unit inventory, however, pruning is critical to removing any instances that prove either...
Phoneme segmentation is a fundamental problem in many speech recognition and synthesis studies. Unsupervised phoneme segmentation assumes no knowledge on linguistic contents and acoustic models, and thus poses a challenging problem. The essential question here is what is the optimal segmentation. This paper formulates the optimal segmentation problem into a probabilistic framework. Using statistics...
A novel method of unit database pruning for concatenative speech synthesis is proposed. The proposed method uses sums of the unit preference criterion, which are calculated from cost degradation from the optimal sequence, instead of the appearance frequencies of units, which is used in the conventional method. Therefore, the proposed method is an extension of the conventional method. Since not only...
We present a new algorithm for the automatic estimation of the voicing cut-off frequency (VCO), i.e., the frequency that separates the periodic low-frequency part from the aperiodic high-frequency part in voiced segments of natural speech. Starting from the power spectrum of a two pitch period speech frame, we define the VCO to be located at the frequency for which the sum of the periodic and aperiodic...
Recently, research related to multi-lingual and cross-lingual speech has gained increasing popularity. One of the major problems when dealing with multi-lingual speech data is the mapping of the phone sets between different languages. Phone mapping is useful for cross-lingual speech recognition, cross-lingual pronunciation modelling and mixed language speech synthesis, to name a few. In this paper,...
We propose a technique for synthesizing speech with desired style expressivity of an arbitrary target speaker's voice. In an MLLR-based speaker adaptation technique for multiple regression hidden semi-Markov model (MRHSMM), the quality of synthesized speech crucially depends on the initial MRHSMM trained from a certain source speaker's data and it is not always possible to synthesize natural sounding...
Harmonic + noise model (HNM) is a hybrid model of speech with a harmonic component and a noise component. While the harmonic part describes efficiently the periodicities in speech signals (voiced parts), modeling of the noise part introduces artifacts primarily because of the specific time-domain characteristics of noise in voiced speech. In this paper, we concentrated on the modeling of noise in...
This paper describes a speaker-independent/adaptive HMM-based speech synthesis system developed for the Blizzard Challenge 2007. The new system, named "HTS-2007", employs speaker adaptation (CSMAPLR+MAP), feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in our...
In this contribution, a time-varying linear prediction is proposed for speech analysis and synthesis. In comparison to the time-invariant prediction, the predictor coefficients are time-varying within the frames. For that purpose, the coefficient trajectories can be described by basis functions. This approach leads to discontinuities between the frames if the frames are analyzed independently. Therefore,...
A simple new method for estimating temporally stable power spectra is introduced to provide a unified basis for computing an interference-free spectrum, the fundamental frequency (F0), as well as aperiodicity estimation. F0 adaptive spectral smoothing and cepstral liftering based on consistent sampling theory are employed for interference-free spectral estimation. A perturbation spectrum, calculated...
Homograph disambiguation is the core issue of the grapheme- to-phoneme conversion in Mandarin Text-to-Speech system. In this paper, a hybrid algorithm called tree-guided transformation-based learning (TTBL), which combines decision tree with transformation-based learning (TBL), is proposed to resolve homograph ambiguity. It can automatically generate templates, thereby avoiding manually summarizing...
We propose a cross-language state mapping approach to HMM-based bilingual TTS. Two language-dependent decision trees are built first with a bilingual speech database recorded by a single speaker. A state mapping for every leaf node in the decision tree of a target language is created by finding the nearest leaf node in the tree of a source language. Kullback-Leibler divergence between two distributions...
The enhancement of short-term spectra of noisy speech can be achieved by statistical estimation of the clean speech spectral components. We present a minimum mean-square error estimator of the clean speech spectral magnitude that uses both a parametric compression function in the estimation error criterion and a parametric prior distribution for the statistical model of the clean speech magnitude...
One of the issues of speech synthesizers based on hidden Markov models concerns the vocoded quality of the synthesized speech. From the principle of analysis-by-synthesis speech coders a trainable excitation model has been proposed to improve naturalness, where the method consists in the design of a set of state-dependent filters in a way to minimize the distortion between residual and synthetic excitation...
The discrete cosine transform is proposed as a basis for representing fundamental frequency (F0) contours of speech. The advantages over existing representations include deterministic algorithms for both analysis and synthesis and a simple distance measure in the parameter space. A two-tier model using the DCT is shown to be able to model F0 contours to around 10 Hz RMS error. A proof-of-concept system...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.