The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Errors in open-domain ASR can be corrected by asking the speaker to rephrase targeted segments in utterances where they have been detected. The utterance merging problem consists in generating a better transcript from the utterance where errors have been detected and a clarification utterance. We introduce an alignment-decoding algorithm for jointly processing the two utterances and benefit from the...
Spoken language understanding (SLU) systems use various features to detect the domain, intent and semantic slots of a query. In addition to n-grams, features generated from entity dictionaries are often used in model training. Clean or properly weighted dictionaries are critical to improve model's coverage and accuracy for unseen entities during test time. However, clean dictionaries are hard to obtain...
This paper presents a Bayesian approach to construct the recurrent neural network language model (RNN-LM) for speech recognition. Our idea is to regularize the RNN-LM by compensating the uncertainty of the estimated model parameters which is represented by a Gaussian prior. The objective function in Bayesian RNN (BRNN) is formed as the regularized cross entropy error function. The regularized model...
Neural network based approaches have recently produced record-setting performances in natural language understanding tasks such as word labeling. In the word labeling task, a tagger is used to assign a label to each word in an input sequence. Specifically, simple recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have shown to significantly outperform the previous state-of-the-art...
In this study, we trained a deep autoencoder to build compact representations of short-term spectra of multiple speakers. Using this compact representation as mapping features, we then trained an artificial neural network to predict target voice features from source voice features. Finally, we constructed a deep neural network from the trained deep autoencoder and artificial neural network weights,...
We are interested in the problem of semantics-aware training of language models (LMs) for Automatic Speech Recognition (ASR). Traditional language modeling research have ignored semantic constraints and focused on limited size histories of words. Semantic structures may provide information to capture lexically realized long-range dependencies as well as the linguistic scene of a speech utterance....
Hierarchical phrase-based machine translation [1] (Hiero) is a prominent approach for Statistical Machine Translation usually comparable to or better than conventional phrase-based systems. But Hiero typically uses the CKY decoding algorithm which requires the entire input sentence before decoding begins, as it produces the translation in a bottom-up fashion. Left-to-right (LR) decoding [2] is a promising...
This paper presents initial data collection and language understanding experiments conducted as part of a larger effort to create a nutrition dialogue system that automatically extracts food concepts from a user's spoken meal description. We first summarize the data collection and annotation of food descriptions performed via Amazon Mechanical Turk. We then present semantic labeling experiments using...
Statistical spoken dialogue systems based on Partially Observable Markov Decision Processes (POMDPs) have been shown to be more robust to speech recognition errors bymaintaining a belief distribution over multiple dialogue states and making policy decisions based on the entire distribution rather than the single most likely hypothesis. To date most POMDPbased systems have used generative trackers...
The decoder is a key component of any modern speech recognizer. Morphologically rich languages pose special challenges for the decoder design, as a very large recognition vocabulary is required to avoid high out-of-vocabulary (OOV) rates. To alleviate these issues, the n-gram models are often trained over subwords instead of words. A subword n-gram model is able to assign probabilities to unseen word...
We aim to improve term detection performance by augmenting traditional N-gram language models with multiple levels of topic context. We demonstrate that incorporating complementary aspects of topicality leads to significant improvements in term detection accuracy. We represent broad topic context through document-specific latent topics inferred via a Bayesian topic model. We capture local topic context...
Handwriting input method is particularly useful for languages with a logographic writing system. This paper introduces a multimodal stroked-based predictive input for the Chinese language. The proposed method requires users to write only the first few strokes of each character and the system will intelligently infer the intended characters by making use of contextual information. Specifically, a statistical...
Over the past decade, several speech-based electronic assistive technologies (EATs) have been developed that target users with dysarthric speech. These EATs include vocal command & control systems, but also voice-input voice-output communication aids (VIVOCAs). In these systems, the vocal interfaces are based on automatic speech recognition systems (ASR), but this approach requires much training...
Recent works showed the trend of leveraging web-scaled structured semantic knowledge resources such as Freebase for open domain spoken language understanding (SLU). Knowledge graphs provide sufficient but ambiguous relations for the same entity, which can be used as statistical background knowledge to infer possible relations for interpretation of user utterances. This paper proposes an approach to...
This paper addresses the problem of detecting name errors in automatic speech recognition (ASR) output. The highly skewed label distributions (i.e. name errors are infrequent), sparse training data, and large number of potential lexical features pose significant challenges for training name error classification systems. Data-driven feature learning is needed for handling multiple languages but is...
Deficits in semantic and pragmatic expression are among the hallmark linguistic features of autism. Recent work in deriving computational correlates of clinical spoken language measures has demonstrated the utility of automated linguistic analysis for characterizing the language of children with autism. Most of this research, however, has focused either on young children still acquiring language or...
The automatic recognition of disordered speech is a domain that is characterised by limited amounts of training data for each speaker and large intra- and inter-speaker variations. This paper is concerned with how best to train an acoustic models in these circumstances; in particular, we look at how to select data for a background model from a pool of speakers for a given target speaker. We show that...
Existing speech classification algorithms often perform well when evaluated on training and test data drawn from the same distribution. In practice, however, these distributions are not always the same. In these circumstances, the performance of trained models will likely decrease. In this paper, we discuss an underutilized divergence measure and derive an estimable upper bound on the test error rate...
We show that it is possible to learn an efficient acoustic model using only a small amount of easily available word-level similarity annotations. In contrast to the detailed phonetic labeling required by classical speech recognition technologies, the only information our method requires are pairs of speech excerpts which are known to be similar (same word) and pairs of speech excerpts which are known...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.