Relaxed Attention: A Simple Method to Boost Performance of End-to-End
Automatic Speech Recognition
- URL: http://arxiv.org/abs/2107.01275v1
- Date: Fri, 2 Jul 2021 21:01:17 GMT
- Title: Relaxed Attention: A Simple Method to Boost Performance of End-to-End
Automatic Speech Recognition
- Authors: Timo Lohrenz, Patrick Schwarz, Zhengyang Li, Tim Fingscheidt
- Abstract summary: We introduce the concept of relaxed attention, which is a gradual injection of a uniform distribution to the encoder-decoder attention weights during training.
We find that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models.
On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative.
- Score: 27.530537066239116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, attention-based encoder-decoder (AED) models have shown high
performance for end-to-end automatic speech recognition (ASR) across several
tasks. Addressing overconfidence in such models, in this paper we introduce the
concept of relaxed attention, which is a simple gradual injection of a uniform
distribution to the encoder-decoder attention weights during training that is
easily implemented with two lines of code. We investigate the effect of relaxed
attention across different AED model architectures and two prominent ASR tasks,
Wall Street Journal (WSJ) and Librispeech. We found that transformers trained
with relaxed attention outperform the standard baseline models consistently
during decoding with external language models. On WSJ, we set a new benchmark
for transformer-based end-to-end speech recognition with a word error rate of
3.65%, outperforming state of the art (4.20%) by 13.1% relative, while
introducing only a single hyperparameter. Upon acceptance, models will be
published on github.
Related papers
- Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations [16.577870835480585]
We present a comprehensive analysis on building ASR systems with discrete codes.
We investigate different methods for training such as quantization schemes and time-domain vs spectral feature encodings.
We introduce a pipeline that outperforms Encodec at similar bit-rate.
arXiv Detail & Related papers (2024-07-03T20:51:41Z) - Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models [84.8919069953397]
Self-TAught Recognizer (STAR) is an unsupervised adaptation framework for speech recognition systems.
We show that STAR achieves an average of 13.5% relative reduction in word error rate across 14 target domains.
STAR exhibits high data efficiency that only requires less than one-hour unlabeled data.
arXiv Detail & Related papers (2024-05-23T04:27:11Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly
Detectors [117.61449210940955]
We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level.
We introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects.
We generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames.
arXiv Detail & Related papers (2023-06-21T06:18:05Z) - Relaxed Attention for Transformer Models [29.896876421216373]
In this paper, we explore relaxed attention, a simple and easy-to-implement smoothing of the attention weights.
We show that relaxed attention provides regularization when applied to the self-attention layers in the encoder.
We demonstrate the benefit of relaxed attention across several tasks with clear improvement in combination with recent benchmark approaches.
arXiv Detail & Related papers (2022-09-20T14:10:28Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.