Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models

Rohit Prabhavalkar; Zhong Meng; Weiran Wang; Adam Stooke; Xingyu Cai; Yanzhang He; Arun Narayanan; Dongseong Hwang; Tara N. Sainath; Pedro J. Moreno

doi:10.1109/ICASSP48485.2024.10446985

Back

Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models

Conference proceeding

Open access

Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models

Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai, Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath and Pedro J. Moreno

ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.11816-11820

04/14/2024

DOI: 10.1109/ICASSP48485.2024.10446985

Files and links (1)

url

https://doi.org/10.1109/ICASSP48485.2024.10446985View

Published (Version of record) Open Access

Abstract

The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models continues to improve as they are scaled to larger sizes, with some now reaching billions of parameters. Widespread deployment and adoption of these models, however, requires computationally efficient strategies for decoding. In the present work, we study one such strategy: applying multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. While similar techniques have been investigated in previous work, we achieve dramatically more reduction than has previously been demonstrated through the use of multiple funnel reduction layers. Through ablations, we study the impact of various architectural choices in the encoder to identify the most effective strategies. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task, while improving encoder and decoder latencies by 48% and 92% respectively, relative to a strong but computationally expensive baseline.

Acoustics

Computational efficiency

computational latency

Computational modeling

Decoding

end-to-end ASR

Error analysis

large models

runtime efficiency

Signal processing

Task analysis

Details

Title: Subtitle: Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models
Creators: Rohit Prabhavalkar - Google (United States)
Zhong Meng - Google (United States)
Weiran Wang - Google (United States)
Adam Stooke - Google (United States)
Xingyu Cai - Google (United States)
Yanzhang He - Google (United States)
Arun Narayanan - Google (United States)
Dongseong Hwang - Google (United States)
Tara N. Sainath - Google (United States)
Pedro J. Moreno - Google (United States)
Resource Type: Conference proceeding
Publication Details: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.11816-11820
DOI: 10.1109/ICASSP48485.2024.10446985
eISSN: 2379-190X
Publisher: IEEE
Language: English
Date published: 04/14/2024
Academic Unit: Computer Science
Record Identifier: 9984696720002771

Metrics

4 Record Views