The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper presents a method to extract structural spectral features from spectral envelopes using what-where autoencoders (WWAE) for statistical parametric speech synthesis (SPSS). A WWAE is constructed by concatenating a convolutional net for input encoding and a deconvolutional net for reconstruction. The output values of the max-pooling layer in the encoder and the positions of the max-pooling...
When using connectionist temporal classification (CTC) based acoustic models (AMs) for large vocabulary continuous speech recognition (LVCSR), most previous studies have used a naive interpolation of the CTC-AM score and an additional language model score, although there is no theoretical justification for such an approach. On the other hand, we recently proposed a theoretically more sound decoding...
Sequence-to-sequence models have shown success in end-to-end speech recognition. However these models have only used shallow acoustic encoder networks. In our work, we successively train very deep convolutional networks to add more expressive power and better generalization for end-to-end ASR models. We apply network-in-network principles, batch normalization, residual connections and convolutional...
Deep learning has significantly advanced state-of-the-art of speech recognition in the past few years. However, compared to conventional Gaussian mixture acoustic models, neural network models are usually much larger, and are therefore not very deployable in embedded devices. Previously, we investigated a compact highway deep neural network (HDNN) for acoustic modelling, which is a type of depth-gated...
This paper proposes a long short-term memory recurrent neural network (LSTM-RNN) for extracting melody and simultaneously detecting regions of melody from polyphonic audio using the proposed harmonic sum loss. The previous state-of-the-art algorithms have not been based on machine learning techniques and certainly not on deep architectures. The harmonics structure in melody is incorporated in the...
Dropout, the random dropping out of activations according to a specified rate, is a very simple but effective method to avoid over-fitting of deep neural networks to the training data.
In this paper we present an extension of our previously described neural machine translation based system for punctuated transcription. This extension allows the system to map from per frame acoustic features to word level representations by replacing the traditional encoder in the encoder-decoder architecture with a hierarchical encoder. Furthermore, we show that a system combining lexical and acoustic...
Natural language understanding and dialogue policy learning are both essential in conversational systems that predict the next system actions in response to a current user utterance. Conventional approaches aggregate separate models of natural language understanding (NLU) and system action prediction (SAP) as a pipeline that is sensitive to noisy outputs of error-prone NLU. To address the issues,...
Exemplar-based methods for voice conversion often use a large number of randomly-selected exemplars to ensure good coverage. As a result, the factorization step can be costly. This paper presents two algorithms that can be used to construct compact sets of exemplars. The first algorithm uses a forward selection procedure to build the exemplar set sequentially, selecting exemplar pairs that minimize...
Acoustic unit discovery (AUD) is a process of automatically identifying a categorical acoustic unit inventory from speech and producing corresponding acoustic unit tokenizations. AUD provides an important avenue for unsupervised acoustic model training in a zero resource setting where expert-provided linguistic knowledge and transcribed speech are unavailable. Therefore, to further facilitate zero-resource...
It is very important to exploit abundant unlabeled speech for improving the acoustic model training in automatic speech recognition (ASR). Semi-supervised training methods incorporate unlabeled data in addition to labeled data to enhance the model training, but it encounters the error-prone label problem. The ensemble training scheme trains a set of models and combines them to make the model more...
In recent years, so-called, “end-to-end” speech recognition systems have emerged as viable alternatives to traditional ASR frameworks. Keyword search, localizing an orthographic query in a speech corpus, is typically performed by using automatic speech recognition (ASR) to generate an index. Previous work has evaluated the use of end-to-end systems for ASR on well known corpora (WSJ, Switchboard,...
This paper proposes a novel training algorithm for high-quality Deep Neural Network (DNN)-based speech synthesis. The parameters of synthetic speech tend to be over-smoothed, and this causes significant quality degradation in synthetic speech. The proposed algorithm takes into account an Anti-Spoofing Verification (ASV) as an additional constraint in the acoustic model training. The ASV is a discriminator...
Batch normalization is a standard technique for training deep neural networks. In batch normalization, the input of each hidden layer is first mean-variance normalized and then linearly transformed before applying non-linear activation functions. We propose a novel unsupervised speaker adaptation technique for batch normalized acoustic models. The key idea is to adjust the linear transformations previously...
Subspace methods are used for deep neural network (DNN)-based acoustic model adaptation. These methods first construct a subspace and then perform the speaker adaptation as a point in the subspace. This paper aims to investigate the effectiveness of subspace methods for robust unsupervised adaptation. For the analysis, we compare two state-of-the-art subspace methods, namely, the singular value decomposition...
DNNs have shown remarkable performance in multilingual scenarios; however, these models are often too large in size that adaptation to a target language with relatively small amount of data cannot be well accomplished. In our previous work, we utilized Low-Rank Factorization (LRF) using singular value decomposition for multilingual DNNs to learn compact models which can be adapted more successfully...
To advance the performance of continuous emotion recognition from speech, we introduce a reconstruction-error-based (RE-based) learning framework with memory-enhanced Recurrent Neural Networks (RNN). In the framework, two successive RNN models are adopted, where the first model is used as an autoencoder for reconstructing the original features, and the second is employed to perform emotion prediction...
In this paper we aim to enhance keyword search for conversational telephone speech under low-resourced conditions. Two techniques to improve the detection of out-of-vocabulary keywords are assessed in this study: using extra text resources to augment the lexicon and language model, and via subword units for keyword search. Two approaches for data augmentation are explored to extend the limited amount...
In statistical parametric speech synthesis (SPSS), a few studies have investigated the Lombard effect, specifically by using hidden Markov model (HMM)-based systems. Recently, artificial neural networks have demonstrated promising results in SPSS, specifically by using long short-term memory recurrent neural networks (LSTMs). The Lombard effect, however, has not been studied in the LSTM-based speech...
Articulatory information can effectively model variability in speech and can improve speech recognition performance under varying acoustic conditions. Learning speaker-independent articulatory models has always been challenging, as speaker-specific information in the articulatory and acoustic spaces increases the complexity of the speech-to-articulatory space inverse modeling, which is already an...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.