The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
When using connectionist temporal classification (CTC) based acoustic models (AMs) for large vocabulary continuous speech recognition (LVCSR), most previous studies have used a naive interpolation of the CTC-AM score and an additional language model score, although there is no theoretical justification for such an approach. On the other hand, we recently proposed a theoretically more sound decoding...
It is very important to exploit abundant unlabeled speech for improving the acoustic model training in automatic speech recognition (ASR). Semi-supervised training methods incorporate unlabeled data in addition to labeled data to enhance the model training, but it encounters the error-prone label problem. The ensemble training scheme trains a set of models and combines them to make the model more...
Multilingual (ML) representations play a key role in building speech recognition systems for low resource languages. The IARPA sponsored BABEL program focuses on building speech recognition (ASR) and keyword search (KWS) systems in over 24 languages with limited training data. The most common mechanism to derive ML representations in the BABEL program has been with the use of a two-stage network,...
Building Automatic Speech Recognition (ASR) needs acoustic model, language model and dictionary for intended language, which is also applied for Indonesian ASR. In this paper, Indonesian ASR was built using CMUSphinx toolkit (a Hidden Markov Model based ASR tool) with limited dataset. We use digit corpus and own made language model to trained with the limited dataset. We also investigated the implementation...
In this paper we study the impact of phonetic annotation precision on the accuracy of a state-of-the art ASR (automatic speech recognition) system. This issue becomes important especially if we want to port the system to a new language without spending much time by collecting, checking and annotating a large amount of acoustic data in the target language. First, we describe a series of experiments...
In this paper, we investigate methods to improve the recognition performance of low-resource languages with limited training data by borrowing subspace parameters from a high-resource language in subspace Gaussian mixture model (SGMM) framework. As a first step, only the state-specific vectors are updated using low-resource language, while retaining all the globally shared parameters from the high-resource...
In this paper we present our approach to the rapid and efficient development of an automatic speech recognition (ASR) system for Russian. We try to utilize our tools, procedures and data previously designed and collected for other Slavic languages, Czech and Slovak. We show how we build a large corpus of texts acquired from major publishers' web pages and convert it from Cyrillic to Latin to simplify...
This paper presents a deep recurrent regularization neural network (DRRNN) for speech recognition. Our idea is to build a regularization neural network acoustic model by conducting the hybrid Tikhonov and weight-decay regularization which compensates the variations due to the input speech as well as the model parameters in the restricted Boltzmann machine as a pre-training stage for feature learning...
Currently, most of the acoustic model selection work is done empirically or heuristically or even arbitrarily. In this paper, Genetic Algorithm (GA) based and Particle Swarm Optimization (PSO) based algorithms that consider the number of states and the kernel numbers for the states simultaneously and reject the uniform allocation of Gaussian kernels are proposed to automatically optimize acoustic...
In this paper, we propose a robust classification strategy for distinguishing between a healthy subject and a patient with pulmonary emphysema on the basis of lung sounds. A symptom of pulmonary emphysema is that almost all lung sounds include some abnormal (i.e., adventitious) sounds. However, the great variety of possible adventitious sounds and noises at auscultation makes high-accuracy detection...
In enclosed environments where robots are deployed, the observed speech signal is smeared due to reverberation. This degrades the performance of the automatic speech recognition (ASR). Thus, hands-free speech recognition for human-machine communication is a difficult task. Most speech enhancement techniques used to address this problem enhance the contaminated waveform independent from that of the...
Applications of automatic speech recognition (ASR) have been extended to a variety of tasks and domains, including spontaneous human-human speech. We have developed an ASR system for the Japanese Parliament (Diet), which is deployed this year. By exploiting official records made by human stenographers, we have realized an efficient training scheme of acoustic and language models, which does not require...
Active Learning (AL) is designed to aid the labor-intensive process of training acoustic model for speech recognition. In AL, only the most informative training samples are selected for manual annotation. Thus, how to evaluate the unlabeled samples is worth researching. In this paper, we propose a unified framework to generate confusion networks of multiple levels including character, syllable and...
This paper addresses the design and implementation of automatic speaker verification (ASV) systems. There is great interest in developing and increasing the performance of ASV applications, taking into account the advantages offered when compared to other biometrical methods. State-of-the-art speaker recognizers are based on statistical models such as GMM, HMM, SVM, ANN or hybrid models. This work...
This paper introduces a speech recognition system of Mandarin continuous digits based on Sphinx. The acoustic model of this system is produced by SphinxTrain, and the language model is acquired from the Cmuclmtk statistical language model. In addition, this system makes use of PocketSphinx recognition engine as a decoder. According to the experiment, the correct rate of this system to a sentence of...
It has become common practice to adapt acoustic models to specific-conditions (gender, accent, bandwidth) in order to improve the performance of speech-to-text (STT) transcription systems. With the growing interest in the use of discriminative features produced by a multi layer perceptron (MLP) in such systems, the question arise of whether it is necessary to specialize the MLP to particular conditions,...
We propose a committee-based active learning method for large vocabulary continuous speech recognition. In this approach, multiple recognizers are prepared beforehand, and the recognition results obtained from them are used for selecting utterances. Here, a progressive search method is used for aligning sentences, and voting entropy is used as a measure for selecting utterances. We apply our method...
In this paper, we present a novel version of discriminative training for N-gram language models. Language models impose language specific constraints on the acoustic hypothesis and are crucial in discriminating between competing acoustic hypotheses. As reported in the literature, discriminative training of acoustic models has yielded significant improvements in the performance of a speech recognition...
From statistical learning theory, the generalization capability of a model is the ability to generalize well on unseen test data which follow the same distribution as the training data. This paper investigates how generalization capability can also improve robustness when testing and training data are from different distributions in the context of speech recognition. Two discriminative training (DT)...
Automatic generation of punctuation is an essential feature for many speech-to-text transcription tasks. This paper describes a maximum a-posteriori (MAP) approach for inserting punctuation marks into raw word sequences obtained from automatic speech recognition (ASR). The system consists of an ??acoustic model?? (AM) for prosodic features (actually pause duration) and a ??language model?? (LM) for...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.