Automatic acoustic segmentation in N-best list rescoring for lecture speech recognition

Peng Shen; Xugang Lu; Hisashi Kawai

doi:10.1109/ISCSLP.2016.7918409

Automatic acoustic segmentation in N-best list rescoring for lecture speech recognition

Source

2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP) > 1 - 5

Abstract

Speech segmentation is important in automatic speech recognition (ASR) and machine translation (MT). Particularly in N-best list rescoring processing, generalizing N-best lists consisting of as many as candidates from a decoding lattice requires proper utterance segmentation. In lecture speech recognition, only long audio recordings are provided without any utterance segmentation information. In addition, rather than only speech event, other acoustic events, e.g., laugh, applause, etc., are included in the recordings. Traditional speech segmentation algorithms for ASR focus on acoustic cues in segmentation, while in MT, speech text segmentation algorithms pay much attention to linguistic cues. In this study, we propose a three-stage speech segmentation framework by integrating both the acoustic and linguistic cues. We tested the segmentation framework for lecture speech recognition. Our results showed the effectiveness of the proposed segmentation algorithm.