Multi-Output RNN-T Joint Networks for Multi-Task Learning of ASR and Auxiliary Tasks

Weiran Wang; Ding Zhao; Shaojin Ding; Hao Zhang; Shuo-Yiin Chang; David Rybach; Tara N. Sainath; Yanzhang He; Ian McGraw; Shankar Kumar

doi:10.1109/ICASSP49357.2023.10096273

Back

Multi-Output RNN-T Joint Networks for Multi-Task Learning of ASR and Auxiliary Tasks

Conference proceeding

Open access

Multi-Output RNN-T Joint Networks for Multi-Task Learning of ASR and Auxiliary Tasks

Weiran Wang, Ding Zhao, Shaojin Ding, Hao Zhang, Shuo-Yiin Chang, David Rybach, Tara N. Sainath, Yanzhang He, Ian McGraw and Shankar Kumar

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1-5

06/04/2023

DOI: 10.1109/ICASSP49357.2023.10096273

Files and links (1)

url

https://doi.org/10.1109/ICASSP49357.2023.10096273View

Published (Version of record) Open Access

Abstract

We propose a multi-output joint network architecture for RNN-T transducer, for multi-task modeling of ASR and auxiliary tasks that rely on ASR outputs. Each output of the joint network predicts tar-get labels with disjoint vocabularies for each task, while sharing the same audio features by the encoder and language model features by the prediction network. Each task is trained with an RNN-T loss that marginalizes over all possible paths, and we allow multiple tasks to share the blank logit so that they are synchronized. We demonstrate our method on two auxiliary tasks, namely capitalization and pause prediction, and discuss different considerations for modeling and inference procedures. For capitalization, we successfully distill capitalization labels from a standalone text normalization model, and achieve competitive Uppercase Error Rate (UER) while offering streaming capability and improved inference efficiency. In addition, our model has similar capitalization accuracy compared to a mixed-case ASR model, but obtains improved WERs if integrated with external language models. For pause prediction, we achieve the same performance as the previous two-step approach while providing a simpler training recipe without affecting ASR accuracy.

capitalization

End-to-end ASR

joint network

Multitasking

Network architecture

pause prediction

Predictive models

RNN-Transducer

Signal processing

Training

Transducers

Vocabulary

Details

Title: Subtitle: Multi-Output RNN-T Joint Networks for Multi-Task Learning of ASR and Auxiliary Tasks
Creators: Weiran Wang - Google (United States)
Ding Zhao - Google (United States)
Shaojin Ding - Google (United States)
Hao Zhang - Google (United States)
Shuo-Yiin Chang - Google (United States)
David Rybach - Google (United States)
Tara N. Sainath - Google (United States)
Yanzhang He - Google (United States)
Ian McGraw - Google (United States)
Shankar Kumar - Google (United States)
Resource Type: Conference proceeding
Publication Details: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1-5
DOI: 10.1109/ICASSP49357.2023.10096273
ISSN: 1520-6149
eISSN: 2379-190X
Publisher: IEEE
Language: English
Date published: 06/04/2023
Academic Unit: Computer Science
Record Identifier: 9984696725602771

Metrics

12 Record Views