On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization

Yiling Huang; Weiran Wang; Guanlong Zhao; Hank Liao; Wei Xia; Quan Wang

doi:10.21437/Interspeech.2024-561

Back

On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization

Conference proceeding

Open access

On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization

Yiling Huang, Weiran Wang, Guanlong Zhao, Hank Liao, Wei Xia and Quan Wang

INTERSPEECH 2024, pp.32-36

Interspeech

01/01/2024

DOI: 10.21437/Interspeech.2024-561

Files and links (1)

url

https://doi.org/10.21437/Interspeech.2024-561View

Published (Version of record) Open Access

Abstract

While standard speaker diarization attempts to answer the question "who spoke when", many realistic applications are interested in determining "who spoke what". In both the conventional modularized approach and the more recent end-to-end neural diarization (EEND), an additional automatic speech recognition (ASR) model and an orchestration algorithm are required to associate speakers with recognized words. In this paper, we propose Word-level End-to-End Neural Diarization (WEEND) with auxiliary network, a multi-task learning algorithm that performs end-to-end ASR and speaker diarization in the same architecture by sharing blank logits. Such a framework allows easily adding diarization capabilities to any existing RNN-T based ASR models without Word Error Rate (WER) regressions. Experimental results demonstrate that WEEND outperforms a strong turn-based diarization baseline system on all 2-speaker short-form scenarios, with the capability to generalize to audio lengths of 5 minutes.

Computer Science

Technology

Computer Science, Artificial Intelligence

Science & Technology

Details

Title: Subtitle: On the Success and Limitations of Auxiliary Network Based Word-Level End-to-End Neural Speaker Diarization
Creators: Yiling Huang - Google (United States)
Weiran Wang - Google (United States)
Guanlong Zhao - Google (United States)
Hank Liao - Google (United States)
Wei Xia - Google (United States)
Quan Wang - Google (United States)
Resource Type: Conference proceeding
Publication Details: INTERSPEECH 2024, pp.32-36
Series: Interspeech
DOI: 10.21437/Interspeech.2024-561
ISSN: 2308-457X
Publisher: Isca-Int Speech Communication Assoc
Number of pages: 5
Language: English
Date published: 01/01/2024
Academic Unit: Computer Science
Record Identifier: 9984798231302771

Metrics

4 Record Views

1 Times Cited - Web of Science