On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition

Xiong Xiao; Shengkui Zhao; Douglas L. Jones; Eng Siong Chng; Haizhou Li

doi:10.1109/ICASSP.2017.7952756

On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition

Xiao, Xiong, Zhao, Shengkui, Jones, Douglas L., Chng, Eng Siong, Li, Haizhou

Source

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) > 3246 - 3250

Abstract

Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of the speech and noise spatial covariance matrices (SCM) are crucial for successfully applying the minimum variance distortionless response (MVDR) beamforming. Reliable estimation of time-frequency (TF) masks can improve the estimation of the SCMs and significantly improve the performance of the MVDR beamforming in ASR tasks. In this paper, we focus on the TF mask estimation using recurrent neural networks (RNN). Specifically, our methods include training the RNN to estimate the speech and noise masks independently, training the RNN to minimize the ASR cost function directly, and performing multiple passes to iteratively improve the mask estimation. The proposed methods are evaluated individually and overally on the CHiME-4 challenge. The results show that the proposed methods improve the ASR performance individually and also work complementarily. The overall performance achieves a word error rate of 8.9% with 6-microphone configuration, which is much better than 12.0% achieved with the state-of-the-art MVDR implementation.