JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

Zhong Meng; Weiran Wang; Rohit Prabhavalkar; Tara N. Sainath; Tongzhou Chen; Ehsan Variani; Yu Zhang; Bo Li; Andrew Rosenberg; Bhuvana Ramabhadran

doi:10.1109/ICASSP49357.2023.10095249

Conference proceeding

JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

Zhong Meng, Weiran Wang, Rohit Prabhavalkar, Tara N. Sainath, Tongzhou Chen, Ehsan Variani, Yu Zhang, Bo Li, Andrew Rosenberg and Bhuvana Ramabhadran

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol.2023-, pp.1-5

06/04/2023

DOI: 10.1109/ICASSP49357.2023.10095249

View Online

Abstract

We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition. With JEIT, the E2E model computes an E2E loss on audio-transcript pairs while its ILM estimates a cross-entropy loss on unpaired text. The E2E model is trained to minimize a weighted sum of E2E and ILM losses. During JEIT, ILM absorbs knowledge from unpaired text while the E2E training serves as regularization. Unlike ILM adaptation methods, JEIT does not require a separate adaptation step and avoids the need for Kullback-Leibler divergence regularization of ILM. We also show that modular hybrid autoregressive transducer (MHAT) performs better than HAT in the JEIT framework, and is much more robust than HAT during ILM adaptation. To push the limit of unpaired text injection, we further propose a combined JEIT and JOIST training (CJJT) that benefits from modality matching, encoder text injection and ILM training. Both JEIT and CJJT can foster a more effective LM fusion. With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4% over a model trained without unpaired text.

Adaptation models

Computational modeling

internal LM

Signal processing

Speech recognition

text injection

Text recognition

Training

Transducers

Details

Title: Subtitle: JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition
Creators: Zhong Meng - Google
Weiran Wang - Google
Rohit Prabhavalkar - Google
Tara N. Sainath - Google
Tongzhou Chen - Google
Ehsan Variani - Google
Yu Zhang - Google
Bo Li - Google
Andrew Rosenberg - Google
Bhuvana Ramabhadran - Google
Resource Type: Conference proceeding
Publication Details: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol.2023-, pp.1-5
Publisher: IEEE
DOI: 10.1109/ICASSP49357.2023.10095249
ISSN: 1520-6149
eISSN: 2379-190X
Language: English
Date published: 06/04/2023
Academic Unit: Computer Science
Record Identifier: 9984696716402771

Metrics

1 Record Views