The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Speech transients have been shown to be important cues for identifying and discriminating speech sounds. We previously described a wavelet packet-based method for extracting transient speech (Rasetshwane et al. WASPAA 2007, pp. 179-182). The algorithm uses a ldquotransitivity functionrdquo to characterize the rate of change of wavelet coefficients, and it can be implemented in real-time to process...
This paper presents a new approach to enhance speech based on a distributed microphone network. Each microphone is used to simultaneously classify the input into either one of the noise types or as speech. For enhancing the speech signal a modified spectral subtraction approach is used that utilize the sound information of the entire network to update the noise model even during speech. This improves...
We present an approach to model-based voice activity detection (VAD) for harsh environments. By using mel-frequency cepstral coefficients feature extracted from clean and noisy speech samples, an artificial neural network is trained optimally in order to provide a reliable model. There are three main aspects to this study: First, in addition to the developed model, recent state-of-the-art VAD methods...
Transcription of music is the process of generating a symbolic representation such as a score sheet or a MIDI file from an audio recording of a piece of music. A statistical machine learning approach for detecting note onsets in polyphonic piano music is presented. An area from the spectrogram of the sound is concatenated into one feature vector. A cascade of boosted classifiers is used for dimensionality...
A blind approach for estimating the signal to noise ratio (SNR) of a speech signal corrupted by additive noise is proposed. The method is based on a pattern recognition paradigm using various linear predictive based features, a neural network classifier and estimation combination. Blind SNR estimation is very useful in speaker identification systems in which a confidence metric is determined along...
We consider the problem of word boundary detection in spontaneous speech utterances. Acoustic features have been well explored in the literature in the context of word boundary detection; however, in spontaneous speech of Switchboard-I corpus, we found that the accuracy of word boundary detection using acoustic features is poor (F-score ~ 0.63). We propose a new feature - that captures lexical cues...
Monaural speech segregation in reverberant environments is a very difficult problem. We develop a supervised learning approach by proposing an objective function that directly relates to the computational goal of maximizing signal-to-noise ratio. The model trained using this new objective function yields significantly better results for time-frequency unit labeling. In our segregation system, a segmentation...
The Mel-frequency cepstral coefficient is the most widely used feature in speech and speaker recognition. However, the traditional MFCC is very sensitive to noise interference, which tends to drastically degrade the performance of recognition systems because of the mismatches between training and testing. In this paper, we proposed a new speaker recognition algorithm based on the dynamic MFCC parameters...
Unvoiced speech poses a big challenge to current monaural speech segregation systems. It lacks harmonic structure and is highly susceptible to interference due to its relatively weak energy. This paper describes a new approach to segregate unvoiced speech from nonspeech interference. The system first estimates a voiced binary mask, and then performs unvoiced speech segregation in two stages: segmentation...
The paper analyzes short term auto-correlation property of speech signal and confirms it through detailed comparing experiment with other kinds of signals. By applying the auto-correlation property of current speech frame and frames nearby, a new feature for voice activity detecting called weighted short-term summation of auto-correlation (WSAC) is formed. It is testified that the new VAD feature...
In traditional VAD algorithms, High Order Statistics (HOS) is usually used in time domain and limited to white noise case. In this paper, a spectral domain HOS feature called spectral kurtosis is introduced, on the bases of which an essential exploring to the different characters between speech and noise in spectral domain is carried out. By the introducing of ldquotime delayrdquo and double thresholds...
In this paper, a new feature selection method for speaker recognition is proposed to keep the high quality speech frames for speaker modelling and to remove noisy and corrupted speech frames. In order to obtain robust voice activity detection in variety of acoustic conditions, the spectral subtraction algorithm is adopted to estimate the frame power. An energy based frame selection algorithm is then...
State-of-the-art automatic speech recognition systems typically adopt the feature set containing mel-frequency cepstral coefficients (MFCC) and their time derivatives. The noise vulnerability of MFCC significantly degrades the recognition performance of such systems in noisy conditions. This paper describes a noise-robust feature extraction method. A set of new MFCC features is derived from the dynamic...
Studies have shown that depending on speaker task and environmental conditions, recognizers are sensitive to noisy stressful environments. The focus of this study is to achieve robust recognition in diverse environmental conditions through extracting robust features. Central to the technique is Root Cepstrum Coefficients (RCC) method, instead of logarithm amplitude spectrum and discrete cosine transform...
Mel-frequency cepstral coefficients (MFCC) are the most widely used features for speech recognition. However, MFCC-based speech recognition performance degrades in presence of additive noise. In this paper, we propose a set of noise-robust features based on conventional MFCC feature extraction method. Our proposed method consists of two steps. In the first step, mel sub-band Wiener filtering is carried...
A robust speech feature extraction method based on the power law of hearing and non-uniform spectral compression technique is proposed, and the correspondent model compensation algorithm is given. The mismatch functions, reflecting the infections of additive noise and spectral compression, and the model compensation formulae are deduced. The experiment results show that the significant improvement...
Most current audio-visual automatic speech recognition (AV- ASR) systems use static weights to leverage between audio and visual information during information fusion. State of the art research has led to using audio reliability metrics for dynamically changing the fusion weights in order to successfully improve overall recognition results. So far, however, incorporating visual reliability metrics...
The existing voice activity detectors (VAD) always depend on specific audio codecs and give the degraded performance in the existence of music signals. This paper presents a sound activity detection method independent of audio codecs. An entropy feature set with adaptive noise estimation update is proposed to improve the performance of the entropy in detecting both speech and music. Afterwards, a...
One of the most recent models for voice conversion is the classical LPC analysis-synthesis model combined with GMM, which aims to separate information from excitation and vocal tract and to learn the transformation rules with statistical methods. However, it does not work well as it is supposed to be due to the inaccuracy of the extracted feature information as well as the overly-smoothed spectral...
The Mel-frequency cepstral coefficients (MFCC) are widely used for speech recognition. However, MFCC-based speech recognition performance degrades in presence of additive noise. In this paper, we propose a set of noise-robust features based on conventional MFCC feature extraction method. Our proposed method consists of two steps. In the first step, Mel sub-band spectral subtraction is carried out...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.