Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition

Hardik B. Sailor; Hemant A. Patil

doi:10.1109/TASLP.2016.2607341

Novel Unsupervised Auditory Filterbank Learning Using Convolutional RBM for Speech Recognition

Sailor, H.B., Patil, H.A.

Źródło

IEEE/ACM Transactions on Audio, Speech, and Language Processing > 2016 > 24 > 12 > 2341 - 2353

Abstrakt

To learn auditory filterbanks, recently, we have proposed an unsupervised learning model based on convolutional restricted Boltzmann machine (RBM) with rectified linear units. In this paper, theory, training algorithm of our proposed model, and detailed analysis of learned filterbank are being presented. Learning of the model with different databases shows that the model is able to learn cochlear-like impulse responses that are localized in frequency-domain. An auditory-like scale obtained from filterbanks learned from clean and noisy datasets resembles the Mel scale, which is known to mimic perceptually relevant aspect of speech. We have experimented with both cepstral (denoted as ConvRBM-CC) as well as filterbank features (denoted as ConvRBM-BANK). On large vocabulary continuous speech recognition task, we achieved relative improvement of 7.21–17.8% in word error rate (WER) compared to Mel frequency cepstral coefficient (MFCC) features and 1.35–6.82% compared to Mel filterbank (FBANK) features. On AURORA 4 multicondition training database, the relative improvement in WER by 4.8–13.65% was achieved using a Hybrid Deep Neural Network-Hidden Markov Model (DNN-HMM) system with ConvRBM-CC features. Using ConvRBM-BANK features, we achieve absolute reduction of 1.25–3.85% in WER on AURORA 4 test sets compared to FBANK features. A context-dependent DNN-HMM system further improves performance with a relative improvement of 3.6–4.6% on an average for bigram 5k and tri-gram 5k language models. Hence, our proposed learned filterbank performs better than traditional MFCC and Mel-filterbank features for both clean and multicondition automatic speech recognition (ASR) tasks. A system combination of ConvRBM-BANK and FBANK features further improve performance in all ASR tasks. Cross-domain experiments where subband filters trained on one database are used for the ASR task of another database show that model learns generalized representations of speech signals.