The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
One of the difficulties in sung speech recognition is the small distance in an acoustic space between phonemes in sung speech. Therefore we considered clustering the speech based on a pitch (fundamental frequency F0) and creating a larger distance between the phonemes. In addition, we considered a two-stage training method of DNN-HMM: the first stage is trained by using conventional acoustic features...
This paper investigates the application of unsupervised acoustic unit discovery for topic identification (topic ID) of spoken audio documents. The acoustic unit discovery method is based on a non-parametric Bayesian phone-loop model that segments a speech utterance into phone-like categories. The discovered phone-like (acoustic) units are further fed into the conventional topic ID framework. Using...
Sub-word units like morphemes are selected as the lexicon for highly inflectional languages, as they can provide better coverage and a smaller vocabulary size. However, short units shrink the context of statistical models, prone to morpho-phonetic changes, and not always outperform the word based model. When sequence of units are merged or split, unit boundaries are phonetically harmonized in the...
In this paper we present our approach to the rapid and efficient development of an automatic speech recognition (ASR) system for Russian. We try to utilize our tools, procedures and data previously designed and collected for other Slavic languages, Czech and Slovak. We show how we build a large corpus of texts acquired from major publishers' web pages and convert it from Cyrillic to Latin to simplify...
Standard automatic speech recognition (ASR) systems use phoneme-based pronunciation lexicon prepared by linguistic experts. When the hand crafted pronunciations fail to cover the vocabulary of a new domain, a grapheme-to-phoneme (G2P) converter is used to extract pronunciations for new words and then a phonemebased ASR system is trained. G2P converters are typically trained only on the existing lexicons...
In particular for “low resource” Keyword Search (KWS) and Speech-to-Text (STT) tasks, more untranscribed test data may be available than training data. Several approaches have been proposed to make this data useful during system development, even when initial systems have Word Error Rates (WER) above 70%. In this paper, we present a set of experiments on low-resource languages in telephony speech...
In this paper we describe approaches to building our recent Malay broadcast news audio retrieval system. This system contains speech-to-text and keyword search subsystems. The speech-to-text system is built aiming at two folds: hybrid vocabulary recognition to tackle out-of-vocabulary keyword search issue and diversified acoustic modeling for effective system combination in keyword searching afterwards...
In this paper, we investigate the ability of a recently proposed discriminatively trained, multi-level context-dependent acoustic model to adapt to a new speaker in both supervised and unsupervised adaptation scenarios. Speaker adaptive speech recognition experiments performed on a large-vocabulary spoken lecture task show that the multi-level model reduces word error rates by more than 10% in both...
In prior work, we proposed a method for vocabulary acquisition based on a co-occurrence model and non-negative matrix factorization. The vocabulary is described in terms of co-occurrence statistics of frame-level acoustic descriptions and suffers from poor scalability to larger vocabularies. Much like whole-word HMM models, there is no reuse of a sub-word units such as phone models. In this paper,...
A speech recognition system that automatically learns word models for a small vocabulary from examples of its usage, without using prior linguistic information, can be of great use in cognitive robotics, human-machine interfaces, and assistive devices. In the latter case, the user's speech capabilities may also be affected. In this paper, we consider a NMF-based learning framework capable of doing...
For large-vocabulary continuous speech recognition (LVCSR) of highly-inflected languages, selection of an appropriate recognition unit is the first important step. The morpheme-based approach is often adopted because of its high coverage and linguistic properties. But morpheme units are short, often consisting of one or two phonemes, thus they are more likely to be confused in ASR than word units...
The context-independent deep belief network (DBN) hidden Markov model (HMM) hybrid architecture has recently achieved promising results for phone recognition. In this work, we propose a context-dependent DBN-HMM system that dramatically outperforms strong Gaussian mixture model (GMM)-HMM baselines on a challenging, large vocabulary, spontaneous speech recognition dataset from the Bing mobile voice...
This paper investigates the use of phoneme class conditional probabilities as features (posterior features) for template-based ASR. Using 75 words and 600 words task-independent and speaker-independent setup on Phonebook database, we investigate the use of different posterior distribution estimators, different distance measures that are better suited for posterior distributions, and different training...
In this paper, the task of selecting the optimal subset of pronunciation variants from a set of automatically generated candidates is recast as a tree search problem. In this approach, the optimal recognition lexicon corresponds with the optimal path through a search tree. We define a discriminative evaluation function to guide the search algorithm, which is based on estimates of the number of recognition...
The large amount of variation present in native speakers' pronunciation of non-native proper names is a big challenge for most automatic speech recognition systems today. The recognizer's ability to handle a variety of different pronunciations is therefore critical to achieve an acceptable recognition performance for this task. This problem has traditionally been solved by including alternative pronunciation...
This paper investigates unsupervised vocabulary and language model self-adaptation (VLA) from just one speech file using the web as a knowledge source and without prior knowledge of topic or domain beyond optional file metadata. Single-file self adaptation is regularly used for acoustic adaptation, but to date, is rarely used for VLA. The method investigated here uses a first-pass transcript or file...
This paper describes recent improvements to the Cambridge Arabic Large Vocabulary Continuous Speech Recognition (LVSCR) Speech-to-Text (STT) system. It is shown that Multi-Layer Perceptron (MLP) features trained on phonetic targets can improve the performance of both phonemic and graphemic systems. Also, a morphological decomposition scheme is extended from the graphemic domain to the phonetic domain,...
This paper focuses on comparison of two continuous space language modeling techniques, namely Tied-Mixture Language modeling (TMLM) and Neural Network Based Language Modeling (NNLM). Additionally, we report on using alternative feature representations for words and histories used in TMLM. Besides bigram co-occurrence based features we consider using NNLM based input features for training TMLMs. We...
We propose a committee-based active learning method for large vocabulary continuous speech recognition. In this approach, multiple recognizers are prepared beforehand, and the recognition results obtained from them are used for selecting utterances. Here, a progressive search method is used for aligning sentences, and voting entropy is used as a measure for selecting utterances. We apply our method...
Although research has previously been done on multilingual speech recognition, it has been found to be very difficult to improve over separately trained systems. The usual approach has been to use some kind of “universal phone set” that covers multiple languages. We report experiments on a different approach to multilingual speech recognition, in which the phone sets are entirely distinct but the...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.