Conference proceeding
JOIST: A Joint Speech and Text Streaming Model for ASR
2022 IEEE Spoken Language Technology Workshop (SLT), pp.52-59
01/09/2023
DOI: 10.1109/SLT54892.2023.10022774
Abstract
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training with both modalities, rather than pre-training and fine-tuning. In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties compared to previous works. Through a series of ablation studies, we explore different types of text modeling, including how to model the length of the text sequence and the appropriate text subword unit representation. We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text. In addition, we quantitatively show that JOIST maintains streaming capabilities, which is important for good user-level experience.
Details
- Title: Subtitle
- JOIST: A Joint Speech and Text Streaming Model for ASR
- Creators
- Tara N. Sainath - GoogleRohit Prabhavalkar - GoogleAnkur Bapna - GoogleYu Zhang - GoogleZhouyuan Huo - GoogleZhehuai Chen - GoogleBo Li - GoogleWeiran Wang - GoogleTrevor Strohman - Google
- Resource Type
- Conference proceeding
- Publication Details
- 2022 IEEE Spoken Language Technology Workshop (SLT), pp.52-59
- Publisher
- IEEE
- DOI
- 10.1109/SLT54892.2023.10022774
- ISSN
- 2639-5479
- Language
- English
- Date published
- 01/09/2023
- Academic Unit
- Computer Science
- Record Identifier
- 9984696578502771
Metrics
1 Record Views