Reducing total latency in online real-time inference and decoding via combined context window and model smoothing latencies

Chandrashekhar Lavania; Jeff Bilmes

doi:10.1109/ICASSP.2017.7952665

Reducing total latency in online real-time inference and decoding via combined context window and model smoothing latencies

Source

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) > 2791 - 2795

Abstract

Real-time low-latency online inference and decoding in sequential probabilistic models are important in many interactive systems, including automatic speech recognition (ASR) and streaming environments. We study total inference latency (TL) in such systems, the additively combined latency of the inherent look-ahead of a deep neural network's (DNN) contextual window (CWL) in a DNN-HMM hybrid system and the latency incurred during Kalman-style smoothing in a dynamic probabilistic model (MSL) (hence, TL = CWL + MSL). For a fixed TL, the best accuracy can occur with a strictly positive MSL, often by quite a bit, a surprising result given the DNN's power. Furthermore, we find that accuracy is often improved with smaller TL and larger MSL. These results suggest that for optimal low-latency real-time decoding, the size of a DNN context window along with model smoothing should be jointly considered.