Efficient End-to-End Speech Recognition Using Performers in Conformers
- URL: http://arxiv.org/abs/2011.04196v2
- Date: Wed, 11 Nov 2020 02:07:46 GMT
- Title: Efficient End-to-End Speech Recognition Using Performers in Conformers
- Authors: Peidong Wang, DeLiang Wang
- Abstract summary: We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
- Score: 74.71219757585841
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: On-device end-to-end speech recognition poses a high requirement on model
efficiency. Most prior works improve the efficiency by reducing model sizes. We
propose to reduce the complexity of model architectures in addition to model
sizes. More specifically, we reduce the floating-point operations in conformer
by replacing the transformer module with a performer. The proposed
attention-based efficient end-to-end speech recognition model yields
competitive performance on the LibriSpeech corpus with 10 millions of
parameters and linear computation complexity. The proposed model also
outperforms previous lightweight end-to-end models by about 20% relatively in
word error rate.
Related papers
- ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model [9.1108256816605]
We propose a method to improve model representation and processing efficiency by replacing the tokenizers of large language models (LLMs)
Our method can maintain the performance of the model after replacing the tokenizer, while significantly improving the decoding speed for long texts.
arXiv Detail & Related papers (2024-10-06T03:01:07Z) - Big model only for hard audios: Sample dependent Whisper model selection
for efficient inferences [7.592727209806414]
Several ASR models exist in various sizes, with different inference costs leading to different performance levels.
We propose to train a decision module, that would allow, given an audio sample, to use the smallest sufficient model leading to a good transcription.
By keeping the decision process computationally efficient, we build a decision module that allows substantial computational savings with reduced performance drops.
arXiv Detail & Related papers (2023-09-22T08:50:58Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - Speech Summarization using Restricted Self-Attention [79.89680891246827]
We introduce a single model optimized end-to-end for speech summarization.
We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos.
arXiv Detail & Related papers (2021-10-12T18:21:23Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z) - Highly Efficient Salient Object Detection with 100K Parameters [137.74898755102387]
We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features.
We build an extremely light-weighted model, namely CSNet, which achieves comparable performance with about 0.2% (100k) of large models on popular object detection benchmarks.
arXiv Detail & Related papers (2020-03-12T07:00:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.