A Comparative Study on E-Branchformer vs Conformer in Speech
Recognition, Translation, and Understanding Tasks
- URL: http://arxiv.org/abs/2305.11073v1
- Date: Thu, 18 May 2023 16:00:48 GMT
- Title: A Comparative Study on E-Branchformer vs Conformer in Speech
Recognition, Translation, and Understanding Tasks
- Authors: Yifan Peng, Kwangyoun Kim, Felix Wu, Brian Yan, Siddhant Arora,
William Chen, Jiyang Tang, Suwon Shon, Prashant Sridhar, Shinji Watanabe
- Abstract summary: Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing.
Recently, a new encoder called E-Branchformer has outperformed Conformer in the ASR benchmark.
This work compares E-Branchformer and Conformer through extensive experiments using different types of end-to-end sequence-to-sequence models.
- Score: 45.01428297033315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conformer, a convolution-augmented Transformer variant, has become the de
facto encoder architecture for speech processing due to its superior
performance in various tasks, including automatic speech recognition (ASR),
speech translation (ST) and spoken language understanding (SLU). Recently, a
new encoder called E-Branchformer has outperformed Conformer in the LibriSpeech
ASR benchmark, making it promising for more general speech applications. This
work compares E-Branchformer and Conformer through extensive experiments using
different types of end-to-end sequence-to-sequence models. Results demonstrate
that E-Branchformer achieves comparable or better performance than Conformer in
almost all evaluation sets across 15 ASR, 2 ST, and 3 SLU benchmarks, while
being more stable during training. We will release our training configurations
and pre-trained models for reproducibility, which can benefit the speech
community.
Related papers
- Multi-Convformer: Extending Conformer with Multiple Convolution Kernels [64.4442240213399]
We introduce Multi-Convformer that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating.
Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient.
We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate(WER) improvements.
arXiv Detail & Related papers (2024-07-04T08:08:12Z) - Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - E-Branchformer: Branchformer with Enhanced merging for speech
recognition [46.14282465455242]
We propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules.
E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.
arXiv Detail & Related papers (2022-09-30T20:22:15Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Adapting Pretrained Transformer to Lattices for Spoken Language
Understanding [39.50831917042577]
It is shown that encoding lattices as opposed to 1-best results generated by automatic speech recognizer (ASR) boosts the performance of spoken language understanding (SLU)
This paper aims at adapting pretrained transformers to lattice inputs in order to perform understanding tasks specifically for spoken language.
arXiv Detail & Related papers (2020-11-02T07:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.