Conference proceeding
On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization
INTERSPEECH 2024, pp.32-36
Interspeech
01/01/2024
DOI: 10.21437/Interspeech.2024-561
Abstract
While standard speaker diarization attempts to answer the question "who spoke when", many realistic applications are interested in determining "who spoke what". In both the conventional modularized approach and the more recent end-to-end neural diarization (EEND), an additional automatic speech recognition (ASR) model and an orchestration algorithm are required to associate speakers with recognized words. In this paper, we propose Word-level End-to-End Neural Diarization (WEEND) with auxiliary network, a multi-task learning algorithm that performs end-to-end ASR and speaker diarization in the same architecture by sharing blank logits. Such a framework allows easily adding diarization capabilities to any existing RNN-T based ASR models without Word Error Rate (WER) regressions. Experimental results demonstrate that WEEND outperforms a strong turn-based diarization baseline system on all 2-speaker short-form scenarios, with the capability to generalize to audio lengths of 5 minutes.
Details
- Title: Subtitle
- On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization
- Creators
- Yiling Huang - Google (United States)Weiran Wang - Google (United States)Guanlong Zhao - Google (United States)Hank Liao - Google (United States)Wei Xia - Google (United States)Quan Wang - Google (United States)
- Resource Type
- Conference proceeding
- Publication Details
- INTERSPEECH 2024, pp.32-36
- Series
- Interspeech
- DOI
- 10.21437/Interspeech.2024-561
- ISSN
- 2308-457X
- Publisher
- Isca-Int Speech Communication Assoc
- Number of pages
- 5
- Language
- English
- Date published
- 01/01/2024
- Academic Unit
- Computer Science
- Record Identifier
- 9984798231302771
Metrics
4 Record Views