The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Proxy-word based out of vocabulary (OOV) keyword search has been proven to be quite effective in keyword search. In proxy-word based OOV keyword search, each OOV keyword is assigned several proxies and detections of the proxies are regarded as detections of the OOV keywords. However, the confidence scores of these detections are still those of the proxies from lattices. To obtain a better confidence...
Computational auditory scene analysis (CASA) system is well used in speech enhancement area in recent years. We propose a new system that combines CASA and spectral subtraction to get better enhanced speech. The CASA part consists of the latest method deep neural networks (DNNs). The original way to reconstruct the denoise signal is to use the estimated masks with direct overlap-add method ignoring...
This paper proposes a speech/music classification system based on i-vector. An analysis of two classification methods, namely cosine distance score (CDS) and support vector machine (SVM) is performed. Two session compensation methods, within-class covariance normalization (WCCN) and linear discriminant analysis (LDA) are also discussed. The performance of proposed systems yields better results compared...
This paper develops a system to automatically distinguish natural speech from synthetic speech. The issue of feature selection is considered. We take commonly used feature Mel-Frequency Cepstrum Coefficient (MFCC) in consideration, as well as other features such as Relative Phase Shift (RPS) and pitch tuned for Automatically Speech Recognition (ASR). We found some features are complimentary in the...
Word posterior probability has been widely used as the confidence estimation of automatic speech recognition (ASR) systems and has been proved to be quite effective in related applications such as keyword search. However, word posterior probability tends to overestimate the true probability of a hypothesis, as it is computed on a subset of the total hypothesis space. In this paper, we show that a...
Speech separation based on deep neural networks (DNNs) has been widely studied recently, and has achieved considerable success. However, previous studies are mostly based on fully-connected neural networks. In order to capture the local information of speech signals, we propose to use convolutional maxout neural networks (CMNNs) to separate speech and noise by estimating the ideal ratio mask of the...
i-Vector modeling has shown to be effective for text independent speaker verification. It represents each utterance as a low-dimensional vector using factor analysis with a GMM supervector. In order to capture more complex speaker statistics, this paper proposes a new feature representation other than i-vectors for speaker verification using neural networks. In this work, stacked bottleneck features...
The OpenKWS14 keyword search evaluation is one of the most challenging and influential evaluations in the field of speech recognition. Its goal is to build a high-performance keyword search system for a minority language with limited training data in a short period of time. We present the system of the Department of Electronic Engineering, Tsinghua University (THUEE team) for the OpenKWS14 keyword...
Albayzin 2012 language recognition evaluation (LRE) is one of the most challenging language recognition evaluation, which is mainly reflected in: (1) the target languages are more confusable with other languages, which might push down the system performance; (2) developing and test data is heterogeneous regarding duration, number of speakers, ambient noise/music, channel conditions, etc. (3) signals...
In this paper, we propose a method to improve detecting the mispronunciation type of the non-native learners. In order to cope with the low-resource condition of non-native speech and the difference of native and non-native speech, the following efforts are made: 1) train acoustic model with the low-resource non-native data; 2) introduce the articulatory-based tandem feature; 3) pool auxiliary native...
The Context-Dependent Deep-Neural-Network HMM, or CD-DNN-HMM, is a powerful acoustic modeling technique. Its training process typically involves unsupervised pre-training and supervised fine-tuning. In the paper, we demonstrate that the performance of DNNs can be improved by utilizing a large amount of unlabeled data in the training procedure. In our method, CD-DNN-HMM trained using 309 hours of unlabeled...
In prosody event detection field, many local acoustic features have been proposed for representing the prosody characteristics of speech unit. The context information that represents some possible regularities underlying neighboring prosody events, however, hasn't been used effectively. The main difficulty to utilize prosodic context is that it's hard to capture the long-distance sequential dependency...
This paper presents a method to improve the mispronunciation detection performance for low-resource acoustic model. The 1h speech data is randomly selected from CU-CHLOE to imitate the low-resource non-native English situation. The Tandem feature derived from articulatory based Multi-Layer Perception (MLP) is employed to replace the traditional spectral feature (e.g. PLP). Further, motivated by similar...
Audio index is an important part of NIST-RT-SD evaluation since 2003. Speaker Diarization is one kind of audio index technology which is marked by different speakers. One essential component of speaker diarization is speaker clustering which is always the pre-processing of speech recognition. The general method is to extract acoustic feature such as LPCC or MFCC and achieve some model such as HMM...
The shifted delta cepstrum (SDC) is a widely used feature extraction for language recognition (LRE). With a high context width due to incorporation of multiple frames, SDC outperforms traditional delta and acceleration feature vectors. However, it also introduces correlation into the concatenated feature vector, which increases redundancy and may degrade the performance of backend classifiers. In...
Combination of different features has been proved to be a good method for improving performance in speech recognition. In speaker recognition (SRE), various features have also been developed to reflect complementary aspects of speaker's characteristics. This paper proposed an effective multi-feature combination in speaker recognition. In order to avoid the “dimensionality disaster” and to delimit...
Double talk detection is used in acoustic echo cancellation system to keep adaptive filter from divergence. This paper describes a new real-time double talk detention algorithm. Voice activity detection algorithm is used to detect the point end of each speech. And then the algorithm uses a logic unit to detected double talk of dialogue. The new algorithm presented in this paper has robustness against...
Mel-frequency cepstrum coefficient (MFCC) is a widely used feature vector in speech signal precessing. Its feature extraction procedure can be seen as a mapping function which transfers the input speech signals to output MFCC feature vectors. However, this function is too complex to analyze and even a simple approximation is not easy to obtain. This paper studies the effects of each MFCC feature extraction...
MVDR beamformer is a robust beamforming method to enhance a desired (speech) signal in the presence of stationary noise. This paper presents a modified Subband post-filtering approach for MVDR beamformer in microphone array system. The quality of the modified Subband post-filtering is studied in simulated rooms with different noise level and is compared to wiener post-filtering proposed in the literature...
This paper explores the use of constrained maximum likelihood linear regression (CMLLR) transforms as features for language recognition. Modeling is carried out through support vector machine (SVM). This work proposes a novel CMLLR supervector kernel. Results on the NIST LRE09 task show that feature-domain CMLLR transforms contain more language dependent information than model-domain MLLRs, and the...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.