Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
- URL: http://arxiv.org/abs/2206.00888v1
- Date: Thu, 2 Jun 2022 06:06:29 GMT
- Title: Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
- Authors: Sehoon Kim, Amir Gholami, Albert Shaw, Nicholas Lee, Karttikeya
Mangalam, Jitendra Malik, Michael W. Mahoney, Kurt Keutzer
- Abstract summary: Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
- Score: 99.349598600887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recently proposed Conformer model has become the de facto backbone model
for various downstream speech tasks based on its hybrid attention-convolution
architecture that captures both local and global features. However, through a
series of systematic studies, we find that the Conformer architecture's design
choices are not optimal. After reexamining the design choices for both the
macro and micro-architecture of Conformer, we propose the Squeezeformer model,
which consistently outperforms the state-of-the-art ASR models under the same
training schemes. In particular, for the macro-architecture, Squeezeformer
incorporates (i) the Temporal U-Net structure, which reduces the cost of the
multi-head attention modules on long sequences, and (ii) a simpler block
structure of feed-forward module, followed up by multi-head attention or
convolution modules, instead of the Macaron structure proposed in Conformer.
Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the
activations in the convolutional block, (ii) removes redundant Layer
Normalization operations, and (iii) incorporates an efficient depth-wise
downsampling layer to efficiently sub-sample the input signal. Squeezeformer
achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate on
Librispeech test-other without external language models. This is 3.1%, 1.4%,
and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is
open-sourced and available online.
Related papers
- Multi-Convformer: Extending Conformer with Multiple Convolution Kernels [64.4442240213399]
We introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating.
Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient.
We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate(WER) improvements.
arXiv Detail & Related papers (2024-07-04T08:08:12Z) - Sparse Modular Activation for Efficient Sequence Modeling [94.11125833685583]
Recent models combining Linear State Space Models with self-attention mechanisms have demonstrated impressive results across a range of sequence modeling tasks.
Current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs.
We introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely activate sub-modules for sequence elements in a differentiable manner.
arXiv Detail & Related papers (2023-06-19T23:10:02Z) - Systematic Architectural Design of Scale Transformed Attention Condenser
DNNs via Multi-Scale Class Representational Response Similarity Analysis [93.0013343535411]
We propose a novel type of analysis called Multi-Scale Class Representational Response Similarity Analysis (ClassRepSim)
We show that adding STAC modules to ResNet style architectures can result in up to a 1.6% increase in top-1 accuracy.
Results from ClassRepSim analysis can be used to select an effective parameterization of the STAC module resulting in competitive performance.
arXiv Detail & Related papers (2023-06-16T18:29:26Z) - 4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict
decoders [29.799797974513552]
This paper proposes four-decoder joint modeling (4D) of CTC, attention, RNN-T, and mask-predict.
The four decoders are jointly trained so that they can be easily switched depending on the application scenarios.
The experimental results showed that the proposed model consistently reduced the WER.
arXiv Detail & Related papers (2022-12-21T07:15:59Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo
Matching Networks [3.7384509727711923]
We introduce a pairwise feature for deep stereo matching networks, named LSP (Local Similarity Pattern)
Through explicitly revealing the neighbor relationships, LSP contains rich structural information, which can be leveraged to aid for more discriminative feature description.
Secondly, we design a dynamic self-reassembling refinement strategy and apply it to the cost distribution and the disparity map respectively.
arXiv Detail & Related papers (2021-12-02T06:52:54Z) - Efficient conformer: Progressive downsampling and grouped attention for
automatic speech recognition [2.6346614942667235]
We study how to reduce the Conformer architecture complexity with a limited computing budget.
We introduce progressive downsampling to the Conformer encoder and propose a novel attention mechanism named grouped attention.
Within the same computing budget, the proposed architecture achieves better performances with faster training and decoding.
arXiv Detail & Related papers (2021-08-31T07:48:06Z) - X-volution: On the unification of convolution and self-attention [52.80459687846842]
We propose a multi-branch elementary module composed of both convolution and self-attention operation.
The proposed X-volution achieves highly competitive visual understanding improvements.
arXiv Detail & Related papers (2021-06-04T04:32:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.