Logo image
On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization
Conference proceeding   Open access

On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization

Yiling Huang, Weiran Wang, Guanlong Zhao, Hank Liao, Wei Xia and Quan Wang
INTERSPEECH 2024, pp.32-36
Interspeech
01/01/2024
DOI: 10.21437/Interspeech.2024-561
url
https://doi.org/10.21437/Interspeech.2024-561View
Published (Version of record) Open Access

Abstract

While standard speaker diarization attempts to answer the question "who spoke when", many realistic applications are interested in determining "who spoke what". In both the conventional modularized approach and the more recent end-to-end neural diarization (EEND), an additional automatic speech recognition (ASR) model and an orchestration algorithm are required to associate speakers with recognized words. In this paper, we propose Word-level End-to-End Neural Diarization (WEEND) with auxiliary network, a multi-task learning algorithm that performs end-to-end ASR and speaker diarization in the same architecture by sharing blank logits. Such a framework allows easily adding diarization capabilities to any existing RNN-T based ASR models without Word Error Rate (WER) regressions. Experimental results demonstrate that WEEND outperforms a strong turn-based diarization baseline system on all 2-speaker short-form scenarios, with the capability to generalize to audio lengths of 5 minutes.
Computer Science Technology Computer Science, Artificial Intelligence Science & Technology

Details

Metrics

Logo image