The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Recently, a novel speaker adaptation method was proposed that applied the Speaker Adaptive Training (SAT) concept to a speech recognizer consisting of a Deep Neural Network (DNN) and a Hidden Markov Model (HMM), and its utility was demonstrated. This method implements the SAT scheme by allocating one Speaker Dependent (SD) module for each training speaker to one of the intermediate layers of the front-end...
Expressive intonation makes focal prominence to give emphases that highlight the focus of speech. This paper describes a method for improving the expressiveness of HMM-based voices, particularly putting focal prominence on a word. Different from previous methods, our method exploits a speech corpus available for model training, without needing to record additional emphasis speech. This method employs...
Robot utterances generally sound monotonous, unnatural, and unfriendly because their Text-to-Speech (TTS) systems are not optimized for communication but for text-reading. Here we present a non-monologue speech synthesis for robots. We collected a speech corpus in a non-monologue style in which two professional voice talents read scripted dialogues. Hidden Markov models (HMMs) were then trained with...
This study presents an overview of VoiceTra, which was developed by NICT and released as the world'fs first network-based multilingual speech-to-speech translation system for smartphones, and describes in detail its multilingual speech recognition, its multilingual translation, and its multilingual speech synthesis in regards to field experiments. We show the effects of system updates using the data...
We proposed the WFSTDM which is an expandable and adaptable dialogue management platform. The WFSTDM combines various WFSTs and enables us to develop new dialogue management WFSTs necessary for rapid prototyping of spoken dialogue systems. In this paper, we illustrate the outline of the WFSTDM and introduce the WFSTDM builder, a network-based spoken dialogue system development tool. In addition, we...
Noise reduction algorithms are widely used to mitigate noise effects on speech to improve the robustness of speech technology applications. However, they inevitably cause speech distortion. The tradeoff between noise reduction and speech distortion is a key concern in designing noise reduction algorithms. This study proposes a novel framework for noise reduction by considering this tradeoff. We regard...
The tradeoff between noise reduction and speech distortion is a key concern in designing noise reduction algorithms. We have proposed a regularization framework for noise reduction with the consideration of the tradeoff problem. We regard speech estimation as a functional approximation problem in a reproducing kernel Hilbert space (RKHS). In the estimation, the objective function is formulated to...
Ensemble acoustic modeling can be used to model different factors that cause variability of acoustic space, and provide different combination to improve the performance of automatic speech recognition (ASR). One of the main concerns is how to partition the training data set to several subsets based on which ensemble models are trained. In this study, we focus on ensemble acoustic modeling concerned...
In this paper, we present our work on collecting spontaneous texts from the Web for constructing a language model in a Chinese speech recognition system. The selection of spontaneous-like texts involves two steps: First, word-segmented web texts are selected using a perplexity-based approach in which the style-related words are strengthened by omitting infrequent topic words from similarity measurements...
Use of a linear projection (LP) function to transform multiple sets of acoustic models into a single set of acoustic models is proposed for characterizing testing environments for robust automatic speech recognition. The LP function is an extension of the linear regression (LR) function used in maximum likelihood linear regression (MLLR) and maximum a posteriori linear regression (MAPLR) by incorporating...
In this paper, we propose an integration process of feature compensation and selection on the collective acoustic feature sets to derive a set of advanced acoustic features for speaker state recognition. For feature normalization, we perform a two-dimensional histogram equalization (2-D HEQ) normalization to reduce variability of speaker and speaking environment factors. For feature selection, we...
This paper reports an automatic speech summarization method and experimental results using English broadcast news speech. In our proposed method, a set of words maximizing a summarization score indicating an appropriateness of summarization is extracted from automatically transcribed speech. This extraction is performed using a Dynamic Programming (DP) technique according to a target compression ratio...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.