Relaxed Attention for Transformer Models
- URL: http://arxiv.org/abs/2209.09735v1
- Date: Tue, 20 Sep 2022 14:10:28 GMT
- Title: Relaxed Attention for Transformer Models
- Authors: Timo Lohrenz and Bj\"orn M\"oller and Zhengyang Li and Tim Fingscheidt
- Abstract summary: In this paper, we explore relaxed attention, a simple and easy-to-implement smoothing of the attention weights.
We show that relaxed attention provides regularization when applied to the self-attention layers in the encoder.
We demonstrate the benefit of relaxed attention across several tasks with clear improvement in combination with recent benchmark approaches.
- Score: 29.896876421216373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The powerful modeling capabilities of all-attention-based transformer
architectures often cause overfitting and - for natural language processing
tasks - lead to an implicitly learned internal language model in the
autoregressive transformer decoder complicating the integration of external
language models. In this paper, we explore relaxed attention, a simple and
easy-to-implement smoothing of the attention weights, yielding a two-fold
improvement to the general transformer architecture: First, relaxed attention
provides regularization when applied to the self-attention layers in the
encoder. Second, we show that it naturally supports the integration of an
external language model as it suppresses the implicitly learned internal
language model by relaxing the cross attention in the decoder. We demonstrate
the benefit of relaxed attention across several tasks with clear improvement in
combination with recent benchmark approaches. Specifically, we exceed the
former state-of-the-art performance of 26.90% word error rate on the largest
public lip-reading LRS3 benchmark with a word error rate of 26.31%, as well as
we achieve a top-performing BLEU score of 37.67 on the IWSLT14
(DE$\rightarrow$EN) machine translation task without external language models
and virtually no additional model parameters. Code and models will be made
publicly available.
Related papers
- Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - On the Analysis of Cross-Lingual Prompt Tuning for Decoder-based
Multilingual Model [49.81429697921861]
We study the interaction between parameter-efficient fine-tuning (PEFT) and cross-lingual tasks in multilingual autoregressive models.
We show that prompt tuning is more effective in enhancing the performance of low-resource languages than fine-tuning.
arXiv Detail & Related papers (2023-11-14T00:43:33Z) - Multi-Head State Space Model for Speech Recognition [44.04124537862432]
State space models (SSMs) have recently shown promising results on small-scale sequence and language modelling tasks.
In this paper, we propose a multi-head state space (MH-SSM) architecture equipped with special gating mechanisms.
As a drop-in replacement for multi-head attention in transformer encoders, this new model significantly outperforms the transformer transducer on the LibriSpeech speech recognition corpus.
arXiv Detail & Related papers (2023-05-21T16:28:57Z) - Shapley Head Pruning: Identifying and Removing Interference in
Multilingual Transformers [54.4919139401528]
We show that it is possible to reduce interference by identifying and pruning language-specific parameters.
We show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction.
arXiv Detail & Related papers (2022-10-11T18:11:37Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Relaxed Attention: A Simple Method to Boost Performance of End-to-End
Automatic Speech Recognition [27.530537066239116]
We introduce the concept of relaxed attention, which is a gradual injection of a uniform distribution to the encoder-decoder attention weights during training.
We find that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models.
On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative.
arXiv Detail & Related papers (2021-07-02T21:01:17Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z) - Attention Is All You Need [36.87735219227719]
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms.
Experiments on two machine translation tasks show these models to be superior in quality.
arXiv Detail & Related papers (2017-06-12T17:57:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.