The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper presents a method for phone-dependent weighting within phonotactic models in automatic language identification. Based on statistical analysis of the phonetic-recognizer behaviour, a phone confidence measure is derived and used to weight the bigram probabilities during testing. The confidence corresponds to the expected decoding stability of individual phones. The proposed method was shown...
The most successful approach to speech and speaker recognition is to treat the speech signal as a stochastic pattern and to use a statistical pattern recognition technique for matching utterances. This paper attempts to study the performance of Text dependent speaker verification system using Delta-Delta Mel Frequency Cepstral Coefficients (MFCC-Δ-Δ) feature vector and Fuzzy C means (FCM) speaker...
Monaural singing voice separation has aroused considerable attention. Many pitch-based methods have been proposed to address this task, but generally have limited performance. The most crucial difficulties lie in the inaccurate judgment on voiced pitches and the failed recognition on unvoiced singing sounds. In this paper, we propose a novel algorithm based on the latent component analysis of time-frequency...
The current benchmark speech-based depression detection techniques rely on acoustic speech parameters collected from large sets of representative speech recordings. This study for the first time investigates depression detection based on the higher order influence model (HOIM) coefficients and emotional transition parameters derived from a relatively small set of conversational speech recordings representing...
Speaker identification (SID) in cochannel speech, where two speakers are talking simultaneously over a single recording channel, is a challenging problem. Previous studies address this problem in the anechoic environment under the Gaussian mixture model (GMM) framework. On the other hand, cochannel SID in reverberant conditions has not been addressed. This paper studies cochannel SID in both anechoic...
For automatic speech recognition (ASR) of lectures, texts of presentation slides are expected to be useful for adapting a language model, while slide texts are not always available in a machine-readable form. In this paper, we propose a language model adaptation framework that uses character recognition results of slide images in a lecture video. Since character recognition results contain many errors,...
We propose a novel approach for addressing automatic speech recognition (ASR) and natural language understanding (NLU) errors in an interactive spoken dialog system using targeted clarification (TC). TC applies when a spoken utterance is partially recognized by focusing a clarification question on the misrecognized part of the utterance. A key component of TC is accurate detection of localized ASR...
Indoor localization of multiple speech sources in wireless acoustic sensor networks (WASNs) is an open and interesting problem with many practical applications, but the presence of noise and reverberations complicates the problem. In this paper, a distributed algorithm for multiple DOA estimation of speech sources in WASNs is presented. The method exploits the sparsity of speech sources in the time-frequency...
A group of junior and senior researchers gathered as a part of the 2014 Frederick Jelinek Memorial Workshop in Prague to address the problem of predicting the accuracy of a nonlinear Deep Neural Network probability estimator for unknown data in a different application domain from the domain in which the estimator was trained. The paper describes the problem and summarizes approaches that were taken...
We propose the prediction-adaptation-correction RNN (PAC-RNN), in which a correction DNN estimates the state posterior probability based on both the current frame and the prediction made on the past frames by a prediction DNN. The result from the main DNN is fed back to the prediction DNN to make better predictions for the future frames. In the PAC-RNN, we can consider that, given the new, current...
In spatial audio analysis-synthesis, one of the key issues is to decompose a signal into primary and ambient components based on their spatial features. Principal component analysis (PCA) has been widely employed in primary component extraction, and shifted PCA (SPCA) is employed to enhance the primary extraction for input signals involving the inter-channel time difference. However, SPCA generally...
In this paper, we propose a low-rank tensor deconvolution problem which seeks multiway replicative patterns and corresponding activating tensors of rank-1. An alternating least squares (ALS) algorithm has been derived for the model to sequentially update loading components and the patterns. In addition, together with a good initialisation method using tensor diagonalization, the update rules have...
Traditional sound event recognition methods based on informative front end features such as MFCC, with back end sequencing methods such as HMM, tend to perform poorly in the presence of interfering acoustic noise. Since noise corruption may be unavoidable in practical situations, it is important to develop more robust features and classifiers. Recent advances in this field use powerful machine learning...
Main-stream speech codecs are based on modelling the speech source by a linear predictor. An efficient domain for quantization and coding of this linear predictor is the line spectral frequency representation, where the predictor is encoded into an ordered set of frequencies that correspond to the roots of the corresponding line spectral polynomials. While this representation is robust in terms of...
This paper investigates the effect of voice quality in commutative speech. Voice quality is often considered as the characteristic auditory colouring of an individual speaker's voice, but in our study, we find that voice quality can also reveal information about the interlocutor in everyday social interactions. In the correlation analysis between acoustic measures and interlocutors, the effect caused...
The present paper dealt with speaker clustering for speech corrupted by noise. In general, the performance of speaker clustering significantly depends on how well the similarities between speech utterances can be measured. The recently proposed i-vector-based cosine similarity has yielded the state-of-the-art performance in speaker clustering systems. However, this similarity often fails to capture...
The robustness of speech recognizers towards noise can be increased by normalizing the statistical moments of the Mel-frequency cepstral coefficients (MFCCs), e. g. by using cepstral mean normalization (CMN) or cepstral mean and variance normalization (CMVN). The necessary statistics are estimated over a long time window and often, a complete utterance is chosen. Consequently, changes in the background...
In this paper we explore one of the key aspects in building an emotion recognition system: generating suitable feature representations. We generate feature representations from both acoustic and lexical levels. At the acoustic level, we first extract low-level features such as intensity, F0, jitter, shimmer and spectral contours etc. We then generate different acoustic feature representations based...
For many years, filterbanks have been widely used as one step of frontend feature extraction for Automatic Speech Recognition (ASR). In this paper, we propose a unified framework for ASR frontends, by first moving the nonlinear amplitude scaling, and then combining the filterbank weights with the cosine basis vectors. As part of this framework, we also show that the delta terms used to encode feature...
Recurrent neural networks (RNNs) have recently been applied as the classifiers for sequential labeling problems. In this paper, deep bidirectional RNNs (DBRNNs) are applied for the first time to error detection in automatic speech recognition (ASR), which is a sequential labeling problem. We investigate three types of ASR error detection tasks, i.e. confidence estimation, out-of-vocabulary word detection...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.