Practical Conformer: Optimizing size, speed and flops of Conformer for
on-Device and cloud ASR
- URL: http://arxiv.org/abs/2304.00171v1
- Date: Fri, 31 Mar 2023 23:30:48 GMT
- Title: Practical Conformer: Optimizing size, speed and flops of Conformer for
on-Device and cloud ASR
- Authors: Rami Botros, Anmol Gulati, Tara N. Sainath, Krzysztof Choromanski,
Ruoming Pang, Trevor Strohman, Weiran Wang, Jiahui Yu
- Abstract summary: We design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs.
Our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
- Score: 67.63332492134332
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conformer models maintain a large number of internal states, the vast
majority of which are associated with self-attention layers. With limited
memory bandwidth, reading these from memory at each inference step can slow
down inference. In this paper, we design an optimized conformer that is small
enough to meet on-device restrictions and has fast inference on TPUs. We
explore various ideas to improve the execution speed, including replacing lower
conformer blocks with convolution-only blocks, strategically downsizing the
architecture, and utilizing an RNNAttention-Performer. Our optimized conformer
can be readily incorporated into a cascaded-encoder setting, allowing a
second-pass decoder to operate on its output and improve the accuracy whenever
more resources are available. Altogether, we find that these optimizations can
reduce latency by a factor of 6.8x, and come at a reasonable trade-off in
quality. With the cascaded second-pass, we show that the recognition accuracy
is completely recoverable. Thus, our proposed encoder can double as a strong
standalone encoder in on device, and as the first part of a high-performance
ASR pipeline.
Related papers
- SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity [15.872209884833977]
We propose a memory-efficient scheduling method to eliminate memory overhead and an online adjustment mechanism to minimize accuracy degradation.
SparseTem achieves speedup of 1.79x for EfficientDet and 4.72x for CRNN, with minimal accuracy drop and no additional memory overhead.
arXiv Detail & Related papers (2024-10-28T07:13:25Z) - UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.
We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.
UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z) - Coding for Gaussian Two-Way Channels: Linear and Learning-Based
Approaches [28.98777190628006]
We propose two different two-way coding strategies: linear coding and learning-based coding.
For learning-based coding, we introduce a novel recurrent neural network (RNN)-based coding architecture.
Our two-way coding methodologies outperform conventional channel coding schemes significantly in sum-error performance.
arXiv Detail & Related papers (2023-12-31T12:40:18Z) - Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture.
We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference.
A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z) - Quick Dense Retrievers Consume KALE: Post Training Kullback Leibler
Alignment of Embeddings for Asymmetrical dual encoders [89.29256833403169]
We introduce Kullback Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods.
KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation.
Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference.
arXiv Detail & Related papers (2023-03-31T15:44:13Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time
Mobile Acceleration [71.80326738527734]
We propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations.
We show that our pruning scheme mapping methods, together with the general fine-grained structured pruning scheme, outperform the state-of-the-art DNN optimization framework.
arXiv Detail & Related papers (2021-11-22T23:53:14Z) - Easy and Efficient Transformer : Scalable Inference Solution For large
NLP mode [14.321889138798072]
This paper introduces a series of ultra-large-scale pre-training model optimization methods.
An inference engine -- Easy and Efficient Transformer (EET) is proposed.
EET achieves a 1.5-15x state-of-art speedup varying with context length.
arXiv Detail & Related papers (2021-04-26T11:00:56Z) - Communication-Efficient Gradient Coding for Straggler Mitigation in
Distributed Learning [17.454251607446555]
Distributed implementations of gradient-based methods, wherein a server distributes gradient computations across worker machines, need to overcome two limitations.
Ye and Abbe [ICML 2018] proposed a coding-theoretic paradigm to characterize a fundamental trade-off between computation load per worker, communication overhead per worker, and straggler tolerance.
We develop a communication-efficient gradient coding framework to overcome these drawbacks.
arXiv Detail & Related papers (2020-05-14T17:57:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.