The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
End-to-end speech recognition systems have been successfully implemented and have become competitive replacements for hybrid systems. A common loss function to train end-to-end systems is connectionist temporal classification (CTC). This method maximizes the log likelihood between the feature sequence and the associated transcription sequence. However there are some weaknesses with CTC training. The...
Recurrent neural networks (RNNs) have shown an ability to model temporal dependencies. However the problem of exploding or vanishing gradients has limited their application. In recent years, long short-term memory RNNs (LSTM RNNs) have been proposed to solve this problem, and have achieved excellent results. However, because of the large size of LSTM RNNs, they more easily suffer from overfitting,...
i-Vector modeling has shown to be effective for text independent speaker verification. It represents each utterance as a low-dimensional vector using factor analysis with a GMM supervector. In order to capture more complex speaker statistics, this paper proposes a new feature representation other than i-vectors for speaker verification using neural networks. In this work, stacked bottleneck features...
The OpenKWS14 keyword search evaluation is one of the most challenging and influential evaluations in the field of speech recognition. Its goal is to build a high-performance keyword search system for a minority language with limited training data in a short period of time. We present the system of the Department of Electronic Engineering, Tsinghua University (THUEE team) for the OpenKWS14 keyword...
Albayzin 2012 language recognition evaluation (LRE) is one of the most challenging language recognition evaluation, which is mainly reflected in: (1) the target languages are more confusable with other languages, which might push down the system performance; (2) developing and test data is heterogeneous regarding duration, number of speakers, ambient noise/music, channel conditions, etc. (3) signals...
The Context-Dependent Deep-Neural-Network HMM, or CD-DNN-HMM, is a powerful acoustic modeling technique. Its training process typically involves unsupervised pre-training and supervised fine-tuning. In the paper, we demonstrate that the performance of DNNs can be improved by utilizing a large amount of unlabeled data in the training procedure. In our method, CD-DNN-HMM trained using 309 hours of unlabeled...
Automatic multilingual speech recognition is always a difficult task. This paper presents recent work on the development of a Mandarin-English bilingual speech recognition system. A unified single set of bilingual acoustic models based on a novel State-Time-Alignment (STA) method is proposed to balance the performance and the complexity of the bilingual speech recognition system, and a comparison...
The prevailing system for language recognition is the parallel phoneme recognition followed by vector space modeling (PPRVSM), which uses a vector space model to describe the cooccurrence information of phones. As the super-vectors are composed of phonetic N-Grams, so for high dimension vectors, there is a problem that the number of N-Grams grows exponentially as the order N increases, which will...
This paper not only proposes the detection algorithm of traditional ring signal tones, but also researches the Color Ring Back Tone which is well popular recent years. Different testing methods are given according to the difference of music color ring tone and voice prompt color ring tone. Tested on RMTS (Real-world Multi-channel Telephone Speech) database, experiments show that the detection rates...
In this paper, we describe two approaches for language identification (LID) using support vector machines (SVM) and phonetic n-gram. One is to use the language model scores of phone sequences to do SVM training. The other is to use the n-gram probabilities of those phones to train SVM models. For the second approach, we propose a new effective normalization method. In the experiments of 30 s test...
This paper reports an approach to language identification (LID) system fusion using discriminative training. Maximum mutual information (MMI) training for Gaussian mixture model is introduced to the standard LDA-GMM fusion framework. Experimental results show that the proposed fusion scheme outperforms the maximum likelihood (ML) trained backend of LID system. The impact of number of Gaussian mixtures...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.