The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Long short-term memory recurrent neural networks (LSTM-RNNs) have been applied to various speech applications including acoustic modeling for statistical parametric speech synthesis. One of the concerns for applying them to text-to-speech applications is its effect on latency. To address this concern, this paper proposes a low-latency, streaming speech synthesis architecture using unidirectional LSTM-RNNs...
In DNN-based TTS synthesis, DNNs hidden layers can be viewed as deep transformation for linguistic features and the output layers as representation of acoustic space to regress the transformed linguistic features to acoustic parameters. The deep-layered architectures of DNN can not only represent highly-complex transformation compactly, but also take advantage of huge amount of training data. In this...
This paper presents Subspace Gaussian Mixture Model (SGMM) approach employed as a probabilistic generative model to estimate speaker vector representations to be subsequently used in the speaker verification task. SGMMs have already been shown to significantly outperform traditional HMM/GMMs in Automatic Speech Recognition (ASR) applications. An extension to the basic SGMM framework allows to robustly...
A complete emotional expression typically contains a complex temporal course in a natural conversation. Related research on utterance-level and segment-level processing lacks understanding of the underlying structure of emotional speech. In this study, a hierarchical affective structure of an emotional utterance characterized by the probabilistic context free grammars (PCFGs) is proposed for emotion...
The paper presents a method for converting word-based automatic speech recognition (ASR) lattices into word-semantic (W-SE) lattices that contain original words together with a partial semantic information - so-called semantic entities. Semantic entity detection algorithm generates semantic entities based on the expert-defined knowledge. The generated W-SE lattices have smaller vocabulary and consequently...
Hidden Markov Models (HMMs) are powerful statistical techniques with many applications, and in this paper they are used for modeling asymmetric threats. The observations generated by such HMMs are generally cluttered with observations that are not related to the HMM. In this paper a Bernoulli filter is proposed, which processes cluttered observations and is capable of detecting if there is an HMM...
This paper addresses reverberant speech recognition based on front-end processing using DAE (Deep AutoEncoder) coupled with DNN (Deep Neural Network) acoustic model. DAE can effectively and flexibly learn mapping from corrupted speech to the original clean speech based on the deep learning scheme. While this mapping is conventionally conducted only with the acoustic information, we presume the mapping...
This paper investigates modeling nonlinear transformations based on deep neural networks (DNNs). Specifically, a DNN is used as a nonlinear mapping function for feature space transformation for HMM acoustic models. The nonlinear transformations are estimated under the sequence-based maximum likelihood criterion. The likelihood partition function is evaluated using the Monte Carlo method based on importance...
Although context-dependent DNN-HMM systems have achieved significant improvements over GMM-HMM systems, there still exists big performance degradation if the acoustic condition of the test data mismatches that of the training data. Hence, adaptation and adaptive training of DNN are of great research interest. Previous works mainly focus on adapting the parameters of a single DNN by regularized or...
To develop speaker adaptation algorithms for deep neural network (DNN) that are suitable for large-scale online deployment, it is desirable that the adaptation model be represented in a compact form and learned in an unsupervised fashion. In this paper, we propose a novel low-footprint adaptation technique for DNN that adapts the DNN model through node activation functions. The approach introduces...
This paper presents a novel interactive method for recognizing handwritten words, using the inertial sensor data available on smart watches. The goal is to allow the user to write with a finger, and use the smart watch sensor signals to infer what the user has written. Past work has exploited the similarity of handwriting recognition to speech recognition in order to deploy HMM based methods. In contrast...
To accomplish effective communication, interaction partners generally adapt their verbal and non-verbal behavior to that of their interlocutors. This behavior adaptation is often modulated by the underlying emotional states of partners. Modeling such mutual behavioral influence is critical for emotion characterization in an interaction. In this paper, we focus on explicitly modeling the mutual influence...
We introduce in this paper a novel non-blind speech enhancement procedure based on visual speech recognition (VSR). The latter is based on a generative process that analyzes sequences of talking faces and classifies them into visual speech units known as visemes. We use an effective graphical model able to segment and label a given sequence of talking faces into a sequence of visemes. Our model captures...
Dropout and DropConnect can be viewed as regularization methods for deep neural network (DNN) training. In DNN acoustic modeling, the huge number of speech samples makes it expensive to sample the neuron mask (Dropout) or the weight mask (DropConnect) repetitively from a high dimensional distribution. In this paper we investigate the effect of Gaussian stochastic neurons on DNN acoustic modeling....
Due to a large number of parameters in deep neural networks (DNNs), it is challenging to design a small-footprint DNN-based speech recognition system while maintaining a high recognition performance. Even with a singular value matrix decomposition (SVD) method and scalar quantization, the DNN model is still too large to be deployed on many mobile devices. Common practices like reducing the number...
We investigate the problem of incorporating higher-level symbolic score-like information into Automatic Music Transcription (AMT) systems to improve their performance. We use recurrent neural networks (RNNs) and their variants as music language models (MLMs) and present a generative architecture for combining these models with predictions from a frame level acoustic classifier. We also compare different...
Due to the quality of paper and long-time preservation, the ink on one side of the historical documents often seeps through and appears on the other side. In this paper, a new blind ink bleed-through removal method is proposed to deal with the scanned historical document images. The scanned historical document image generally consists of three components: foreground, bleed-through and background....
We propose a novel method for analyzing acoustic scenes that can sophisticatedly estimate acoustic scenes from an acoustic event sequence with intermittent missing events. On the basis of the idea that acoustic events are temporally correlated, we model the transition of acoustic events using a hidden Markov model (HMM) and estimate missing acoustic events. Then, we incorporate the transition of acoustic...
We formulate the problem of detecting the constituent instruments in a polyphonic music piece as a joint decoding problem. From monophonic data, parametric Gaussian Mixture Hidden Markov Models (GM-HMM) are obtained for each instrument. We propose a method to use the above models in a factorial framework, termed as Factorial GM-HMM (F-GM-HMM). The states are jointly inferred to explain the evolution...
Vocoders received renewed attention recently as basic components in speech synthesis applications such as voice transformation, voice conversion and statistical parametric speech synthesis. This paper presents a new vocoder synthesizer, referred to as Vocaine, that features a novel Amplitude Modulated-Frequency Modulated (AM-FM) speech model, a new way to synthesize non-stationary sinusoids using...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.