Speaker diarization using deep neural network embeddings

Daniel Garcia-Romero; David Snyder; Gregory Sell; Daniel Povey; Alan McCree

doi:10.1109/ICASSP.2017.7953094

Speaker diarization using deep neural network embeddings

Garcia-Romero, Daniel, Snyder, David, Sell, Gregory, Povey, Daniel, McCree, Alan

Source

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) > 4930 - 4934

Abstract

Speaker diarization is an important front-end for many speech technologies in the presence of multiple speakers, but current methods that employ i-vector clustering for short segments of speech are potentially too cumbersome and costly for the front-end role. In this work, we propose an alternative approach for learning representations via deep neural networks to remove the i-vector extraction process from the pipeline entirely. The proposed architecture simultaneously learns a fixed-dimensional embedding for acoustic segments of variable length and a scoring function for measuring the likelihood that the segments originated from the same or different speakers. Through tests on the CALLHOME conversational telephone speech corpus, we demonstrate that, in addition to streamlining the diarization architecture, the proposed system matches or exceeds the performance of state-of-the-art baselines. We also show that, though this approach does not respond as well to unsupervised calibration strategies as previous systems, the incorporation of well-founded speaker priors sufficiently mitigates this shortcoming.