Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models
- URL: http://arxiv.org/abs/2402.17184v1
- Date: Tue, 27 Feb 2024 03:40:44 GMT
- Title: Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models
- Authors: Rohit Prabhavalkar, Zhong Meng, Weiran Wang, Adam Stooke, Xingyu Cai,
Yanzhang He, Arun Narayanan, Dongseong Hwang, Tara N. Sainath, Pedro J.
Moreno
- Abstract summary: We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
- Score: 59.57732929473519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models
continues to improve as they are scaled to larger sizes, with some now reaching
billions of parameters. Widespread deployment and adoption of these models,
however, requires computationally efficient strategies for decoding. In the
present work, we study one such strategy: applying multiple frame reduction
layers in the encoder to compress encoder outputs into a small number of output
frames. While similar techniques have been investigated in previous work, we
achieve dramatically more reduction than has previously been demonstrated
through the use of multiple funnel reduction layers. Through ablations, we
study the impact of various architectural choices in the encoder to identify
the most effective strategies. We demonstrate that we can generate one encoder
output frame for every 2.56 sec of input speech, without significantly
affecting word error rate on a large-scale voice search task, while improving
encoder and decoder latencies by 48% and 92% respectively, relative to a strong
but computationally expensive baseline.
Related papers
- Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks [53.550782959908524]
We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks.
Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency.
arXiv Detail & Related papers (2024-03-19T19:27:23Z) - Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features.
We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps.
We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z) - DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder
Transformer Models [22.276574156358084]
We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions.
We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.
arXiv Detail & Related papers (2023-11-15T01:01:02Z) - NASH: A Simple Unified Framework of Structured Pruning for Accelerating
Encoder-Decoder Language Models [29.468888611690346]
We propose a simple and effective framework, NASH, that narrows the encoder and shortens the decoder networks of encoder-decoder models.
Our findings highlight two insights: (1) the number of decoder layers is the dominant factor of inference speed, and (2) low sparsity in the pruned encoder network enhances generation quality.
arXiv Detail & Related papers (2023-10-16T04:27:36Z) - You Need Multiple Exiting: Dynamic Early Exiting for Accelerating
Unified Vision Language Model [37.24203191658052]
Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture.
Performance improvements come with increasing model size, resulting in slow inference speed and increased cost for severing.
We propose a novel early exiting strategy for unified visual language models, which allows dynamically skip the layers in encoder and decoder simultaneously.
arXiv Detail & Related papers (2022-11-21T02:32:25Z) - Reducing Redundancy in the Bottleneck Representation of the Autoencoders [98.78384185493624]
Autoencoders are a type of unsupervised neural networks, which can be used to solve various tasks.
We propose a scheme to explicitly penalize feature redundancies in the bottleneck representation.
We tested our approach across different tasks: dimensionality reduction using three different dataset, image compression using the MNIST dataset, and image denoising using fashion MNIST.
arXiv Detail & Related papers (2022-02-09T18:48:02Z) - Sparse DETR: Efficient End-to-End Object Detection with Learnable
Sparsity [10.098578160958946]
We show that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset.
Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR.
arXiv Detail & Related papers (2021-11-29T05:22:46Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.