A structure-preserving training target for supervised speech separation

Yuxuan Wang; DeLiang Wang

doi:10.1109/ICASSP.2014.6854777

A structure-preserving training target for supervised speech separation

Source

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) > 6107 - 6111

Abstract

Supervised learning based speech separation has shown considerable success recently. In its simplest form, a discriminative model is trained as a time-frequency masking function, where the training target is an ideal mask. Ideal masks, such as the ideal binary masks, are structured spectro-temporal patterns. However, previous formulations do not model prominent output structure. In this paper, we propose an alternative training target that is explicitly related to mask structure. We first learn a compositional model of the square-root ideal ratio mask that is closely related to the Wiener filter. Instead of directly estimating the ideal mask values, we learn to predict the weights for resulting mask-level spectro-temporal bases, which are then used to generate the estimated masks. In other words, the discriminative model is used to predict the parameters of a generative model of the target of interest. Experimental results show consistent improvements in low SNR conditions by adopting the new training target.