Related papers: Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition

Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition

URL: http://arxiv.org/abs/2109.01163v2
Date: Wed, 8 Sep 2021 11:47:52 GMT
Title: Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition
Authors: Maxime Burchi, Valentin Vielzeuf
Abstract summary: We study how to reduce the Conformer architecture complexity with a limited computing budget. We introduce progressive downsampling to the Conformer encoder and propose a novel attention mechanism named grouped attention. Within the same computing budget, the proposed architecture achieves better performances with faster training and decoding.
Score: 2.6346614942667235
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recently proposed Conformer architecture has shown state-of-the-art performances in Automatic Speech Recognition by combining convolution with attention to model both local and global dependencies. In this paper, we study how to reduce the Conformer architecture complexity with a limited computing budget, leading to a more efficient architecture design that we call Efficient Conformer. We introduce progressive downsampling to the Conformer encoder and propose a novel attention mechanism named grouped attention, allowing us to reduce attention complexity from $O(n^{2}d)$ to $O(n^{2}d / g)$ for sequence length $n$, hidden dimension $d$ and group size parameter $g$. We also experiment the use of strided multi-head self-attention as a global downsampling operation. Our experiments are performed on the LibriSpeech dataset with CTC and RNN-Transducer losses. We show that within the same computing budget, the proposed architecture achieves better performances with faster training and decoding compared to the Conformer. Our 13M parameters CTC model achieves competitive WERs of 3.6%/9.0% without using a language model and 2.7%/6.7% with an external n-gram language model on the test-clean/test-other sets while being 29% faster than our CTC Conformer baseline at inference and 36% faster to train.

Related papers

Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition [3.3627327936627416]
The work proposes a new Uconv-Conformer architecture based on the standard Conformer model. We use upsampling blocks similar to the U-Net architecture to ensure the correct CTC loss calculation and stabilize network training. The Uconv-Conformer architecture appears to be not only faster in terms of training and inference but also shows better WER compared to the baseline Conformer.
arXiv Detail & Related papers (2022-08-16T10:40:15Z)
Matching Pursuit Based Scheduling for Over-the-Air Federated Learning [67.59503935237676]
This paper develops a class of low-complexity device scheduling algorithms for over-the-air learning via the method of federated learning. Compared to the state-of-the-art proposed scheme, the proposed scheme poses a drastically lower efficiency system. The efficiency of the proposed scheme is confirmed via experiments on the CIFAR dataset.
arXiv Detail & Related papers (2022-06-14T08:14:14Z)
Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture. We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z)
SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks. Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z)
Improved Mask-CTC for Non-Autoregressive End-to-End ASR [49.192579824582694]
Recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC) We propose to enhance the network architecture by employing a recently proposed architecture called Conformer. Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence.
arXiv Detail & Related papers (2020-10-26T01:22:35Z)
Generating Efficient DNN-Ensembles with Evolutionary Computation [3.28217012194635]
We leverage ensemble learning as a tool for the creation of faster, smaller, and more accurate deep learning models. We run EARN on 10 image classification datasets with an initial pool of 32 state-of-the-art DCNN on both CPU and GPU platforms. We generate models with speedups up to $7.60times$, reductions of parameters by $10times$, or increases in accuracy up to $6.01%$ regarding the best DNN in the pool.
arXiv Detail & Related papers (2020-09-18T09:14:56Z)
AANet: Adaptive Aggregation Network for Efficient Stereo Matching [33.39794232337985]
Current state-of-the-art stereo models are mostly based on costly 3D convolutions. We propose a sparse points based intra-scale cost aggregation method to alleviate the edge-fattening issue. We also approximate traditional cross-scale cost aggregation algorithm with neural network layers to handle large textureless regions.
arXiv Detail & Related papers (2020-04-20T18:07:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.