Efficient conformer: Progressive downsampling and grouped attention for
automatic speech recognition
- URL: http://arxiv.org/abs/2109.01163v2
- Date: Wed, 8 Sep 2021 11:47:52 GMT
- Title: Efficient conformer: Progressive downsampling and grouped attention for
automatic speech recognition
- Authors: Maxime Burchi, Valentin Vielzeuf
- Abstract summary: We study how to reduce the Conformer architecture complexity with a limited computing budget.
We introduce progressive downsampling to the Conformer encoder and propose a novel attention mechanism named grouped attention.
Within the same computing budget, the proposed architecture achieves better performances with faster training and decoding.
- Score: 2.6346614942667235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recently proposed Conformer architecture has shown state-of-the-art
performances in Automatic Speech Recognition by combining convolution with
attention to model both local and global dependencies. In this paper, we study
how to reduce the Conformer architecture complexity with a limited computing
budget, leading to a more efficient architecture design that we call Efficient
Conformer. We introduce progressive downsampling to the Conformer encoder and
propose a novel attention mechanism named grouped attention, allowing us to
reduce attention complexity from $O(n^{2}d)$ to $O(n^{2}d / g)$ for sequence
length $n$, hidden dimension $d$ and group size parameter $g$. We also
experiment the use of strided multi-head self-attention as a global
downsampling operation. Our experiments are performed on the LibriSpeech
dataset with CTC and RNN-Transducer losses. We show that within the same
computing budget, the proposed architecture achieves better performances with
faster training and decoding compared to the Conformer. Our 13M parameters CTC
model achieves competitive WERs of 3.6%/9.0% without using a language model and
2.7%/6.7% with an external n-gram language model on the test-clean/test-other
sets while being 29% faster than our CTC Conformer baseline at inference and
36% faster to train.
Related papers
- Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End
Speech Recognition [3.3627327936627416]
The work proposes a new Uconv-Conformer architecture based on the standard Conformer model.
We use upsampling blocks similar to the U-Net architecture to ensure the correct CTC loss calculation and stabilize network training.
The Uconv-Conformer architecture appears to be not only faster in terms of training and inference but also shows better WER compared to the baseline Conformer.
arXiv Detail & Related papers (2022-08-16T10:40:15Z) - Matching Pursuit Based Scheduling for Over-the-Air Federated Learning [67.59503935237676]
This paper develops a class of low-complexity device scheduling algorithms for over-the-air learning via the method of federated learning.
Compared to the state-of-the-art proposed scheme, the proposed scheme poses a drastically lower efficiency system.
The efficiency of the proposed scheme is confirmed via experiments on the CIFAR dataset.
arXiv Detail & Related papers (2022-06-14T08:14:14Z) - Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - Improved Mask-CTC for Non-Autoregressive End-to-End ASR [49.192579824582694]
Recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC)
We propose to enhance the network architecture by employing a recently proposed architecture called Conformer.
Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence.
arXiv Detail & Related papers (2020-10-26T01:22:35Z) - Generating Efficient DNN-Ensembles with Evolutionary Computation [3.28217012194635]
We leverage ensemble learning as a tool for the creation of faster, smaller, and more accurate deep learning models.
We run EARN on 10 image classification datasets with an initial pool of 32 state-of-the-art DCNN on both CPU and GPU platforms.
We generate models with speedups up to $7.60times$, reductions of parameters by $10times$, or increases in accuracy up to $6.01%$ regarding the best DNN in the pool.
arXiv Detail & Related papers (2020-09-18T09:14:56Z) - AANet: Adaptive Aggregation Network for Efficient Stereo Matching [33.39794232337985]
Current state-of-the-art stereo models are mostly based on costly 3D convolutions.
We propose a sparse points based intra-scale cost aggregation method to alleviate the edge-fattening issue.
We also approximate traditional cross-scale cost aggregation algorithm with neural network layers to handle large textureless regions.
arXiv Detail & Related papers (2020-04-20T18:07:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.