The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Recently, a deep beamforming (BF) network was proposed to predict BF weights from phase-carrying features, such as generalized cross correlation (GCC). The BF network is trained jointly with the acoustic model to minimize automatic speech recognition (ASR) cost function. In this paper, we propose to replace GCC with features derived from input signals' spatial covariance matrices (SCM), which contain...
I-vector adaptation of DNN-HMM acoustic models has shown clear performance improvement for speech recognition. In this paper, we study this technique on Babel task. we use Swahili as target language (training data of 50 hours) and another 6 languages as multilingual resources to train i-vector extractors respectively. Our study shows that i-vector extractors trained with more multilingual data only...
Spoofing speech detection aims to differentiate spoofing speech from natural speech. Frame-based features are usually used in most of previous works. Although multiple frames or dynamic features are used to form a super-vector to represent the temporal information, the time span covered by these features are not sufficient. Most of the systems failed to detect the non-vocoder or unit selection based...
Linear discriminant analysis (LDA) and Gaussian probabilistic LDA (PLDA) have been shown to effectively suppress channel- and session-variability of i-vectors. But they suffer the following limitations: 1) In LDA, a single linear transformation may not be adequate to describe the nonlinear relationship of features and 2) Gaussian-PLDA assumes the speaker and channel factors follow a Gaussian distribution,...
In speech processing, speech signal is usually processed frame by frame due to the non-stationary characteristic of speech. In this paper, a frequency-domain averaging based frame smoothing method is proposed. Besides the conventional frame shift, we introduce a short time shift to create several frames around current frame. Then we take the average of power spectrum for these frames. The average...
In this paper we report our approaches to accomplishing the very limited resource keyword search (KWS) task in the NIST Open Keyword Search 2015 (OpenKWS15) Evaluation. We devised the methods, first, to attain better acoustic modeling, multilingual and semi-supervised acoustic model training as well as the examplar-based acoustic model training; second, to address the overwhelming out-of-vocabulary...
Synthetic speech is speech signals generated by text-to-speech (TTS) and voice conversion (VC) techniques. They impose a threat to speaker verification (SV) systems as an attacker may make use of TTS or VC to synthesize a speakers voice to cheat the SV system. To address this challenge, we study the detection of synthetic speech using long term magnitude and phase information of speech. As most of...
The speaker verification (SV) task has been an active area of research in the last thirty years. One of the recent research topics is on improving the robustness of SV system in challenging environments. This paper examines the robustness of current state of the art SV system against background noise corruptions. Specifically, we consider the scenario where the SV system is trained from noise free...
This paper presents a deep neural network-conditional random field (DNN-CRF) system with multi-view features for sentence unit detection on English broadcast news. We proposed a set of multi-view features extracted from the acoustic, articulatory, and linguistic domains, and used them together in the DNN-CRF model to predict the sentence boundaries. We tested the accuracy of the multi-view features...
This letter presents a new feature normalization technique to normalize the temporal structure of speech features. The temporal structure of the features is partially represented by its power spectral density (PSD). We observed that the PSD of the features varies with the corrupting noise and signal-to-noise ratio. To reduce the PSD variation due to noise, we propose to normalize the PSD of features...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.