Multi-Convformer: Extending Conformer with Multiple Convolution Kernels
- URL: http://arxiv.org/abs/2407.03718v2
- Date: Wed, 24 Jul 2024 02:03:47 GMT
- Title: Multi-Convformer: Extending Conformer with Multiple Convolution Kernels
- Authors: Darshan Prabhu, Yifan Peng, Preethi Jyothi, Shinji Watanabe,
- Abstract summary: We introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating.
Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient.
We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate(WER) improvements.
- Score: 64.4442240213399
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition~(ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate~(WER) improvements.
Related papers
- GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model [66.35608254724566]
State-space models (SSMs) have showcased effective performance in modeling long-range dependencies with subquadratic complexity.
However, pure SSM-based models still face challenges related to stability and achieving optimal performance on computer vision tasks.
Our paper addresses the challenges of scaling SSM-based models for computer vision, particularly the instability and inefficiency of large model sizes.
arXiv Detail & Related papers (2024-07-18T17:59:58Z) - Augmenting conformers with structured state-space sequence models for
online speech recognition [41.444671189679994]
Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems.
In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4)
We performed systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions.
Our best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.
arXiv Detail & Related papers (2023-09-15T17:14:17Z) - Learning Modulated Transformation in GANs [69.95217723100413]
We equip the generator in generative adversarial networks (GANs) with a plug-and-play module, termed as modulated transformation module (MTM)
MTM predicts spatial offsets under the control of latent codes, based on which the convolution operation can be applied at variable locations.
It is noteworthy that towards human generation on the challenging TaiChi dataset, we improve the FID of StyleGAN3 from 21.36 to 13.60, demonstrating the efficacy of learning modulated geometry transformation.
arXiv Detail & Related papers (2023-08-29T17:51:22Z) - A Comparative Study on E-Branchformer vs Conformer in Speech
Recognition, Translation, and Understanding Tasks [45.01428297033315]
Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing.
Recently, a new encoder called E-Branchformer has outperformed Conformer in the ASR benchmark.
This work compares E-Branchformer and Conformer through extensive experiments using different types of end-to-end sequence-to-sequence models.
arXiv Detail & Related papers (2023-05-18T16:00:48Z) - QuadConv: Quadrature-Based Convolutions with Applications to Non-Uniform
PDE Data Compression [6.488002704957669]
We present a new convolution layer for deep learning architectures which we call QuadConv.
Our operator is developed explicitly for use on non-uniform, mesh-based data.
We show that QuadConv can match the performance of standard discrete convolutions on uniform grid data.
arXiv Detail & Related papers (2022-11-09T19:02:40Z) - Branchformer: Parallel MLP-Attention Architectures to Capture Local and
Global Context for Speech Recognition and Understanding [41.928263518867816]
Conformer has proven to be effective in many speech processing tasks.
Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer.
arXiv Detail & Related papers (2022-07-06T21:08:10Z) - Squeezeformer: An Efficient Transformer for Automatic Speech Recognition [99.349598600887]
Conformer is the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture.
We propose the Squeezeformer model, which consistently outperforms the state-of-the-art ASR models under the same training schemes.
arXiv Detail & Related papers (2022-06-02T06:06:29Z) - OneDConv: Generalized Convolution For Transform-Invariant Representation [76.15687106423859]
We propose a novel generalized one dimension convolutional operator (OneDConv)
It dynamically transforms the convolution kernels based on the input features in a computationally and parametrically efficient manner.
It improves the robustness and generalization of convolution without sacrificing the performance on common images.
arXiv Detail & Related papers (2022-01-15T07:44:44Z) - nnFormer: Interleaved Transformer for Volumetric Segmentation [50.10441845967601]
We introduce nnFormer, a powerful segmentation model with an interleaved architecture based on empirical combination of self-attention and convolution.
nnFormer achieves tremendous improvements over previous transformer-based methods on two commonly used datasets Synapse and ACDC.
arXiv Detail & Related papers (2021-09-07T17:08:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.