E-Branchformer: Branchformer with Enhanced merging for speech
recognition
- URL: http://arxiv.org/abs/2210.00077v1
- Date: Fri, 30 Sep 2022 20:22:15 GMT
- Title: E-Branchformer: Branchformer with Enhanced merging for speech
recognition
- Authors: Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu
J. Han, Shinji Watanabe
- Abstract summary: We propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules.
E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.
- Score: 46.14282465455242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conformer, combining convolution and self-attention sequentially to capture
both local and global information, has shown remarkable performance and is
currently regarded as the state-of-the-art for automatic speech recognition
(ASR). Several other studies have explored integrating convolution and
self-attention but they have not managed to match Conformer's performance. The
recently introduced Branchformer achieves comparable performance to Conformer
by using dedicated branches of convolution and self-attention and merging local
and global context from each branch. In this paper, we propose E-Branchformer,
which enhances Branchformer by applying an effective merging method and
stacking additional point-wise modules. E-Branchformer sets new
state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech
test-clean and test-other sets without using any external training data.
Related papers
- reCSE: Portable Reshaping Features for Sentence Embedding in Self-supervised Contrastive Learning [1.4604134018640291]
We propose reCSE, a self supervised contrastive learning sentence representation framework based on feature reshaping.
This framework is different from the current advanced models that use discrete data augmentation methods.
Our reCSE has achieved competitive performance in semantic similarity tasks.
arXiv Detail & Related papers (2024-08-09T09:56:30Z) - One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - A Comparative Study on E-Branchformer vs Conformer in Speech
Recognition, Translation, and Understanding Tasks [45.01428297033315]
Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing.
Recently, a new encoder called E-Branchformer has outperformed Conformer in the ASR benchmark.
This work compares E-Branchformer and Conformer through extensive experiments using different types of end-to-end sequence-to-sequence models.
arXiv Detail & Related papers (2023-05-18T16:00:48Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Branchformer: Parallel MLP-Attention Architectures to Capture Local and
Global Context for Speech Recognition and Understanding [41.928263518867816]
Conformer has proven to be effective in many speech processing tasks.
Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer.
arXiv Detail & Related papers (2022-07-06T21:08:10Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.