Related papers: On the Locality of Attention in Direct Speech Translation

On the Locality of Attention in Direct Speech Translation

URL: http://arxiv.org/abs/2204.09028v1
Date: Tue, 19 Apr 2022 17:43:37 GMT
Title: On the Locality of Attention in Direct Speech Translation
Authors: Belen Alastruey, Javier Ferrando, Gerard I. G\'allego and Marta R. Costa-juss\`a
Abstract summary: Transformers have achieved state-of-the-art results across multiple NLP tasks. We discuss the usefulness of self-attention for Direct Speech Translation.
Score: 0.1749935196721634
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers have achieved state-of-the-art results across multiple NLP tasks. However, the self-attention mechanism complexity scales quadratically with the sequence length, creating an obstacle for tasks involving long sequences, like in the speech domain. In this paper, we discuss the usefulness of self-attention for Direct Speech Translation. First, we analyze the layer-wise token contributions in the self-attention of the encoder, unveiling local diagonal patterns. To prove that some attention weights are avoidable, we propose to substitute the standard self-attention with a local efficient one, setting the amount of context used based on the results of the analysis. With this approach, our model matches the baseline performance, and improves the efficiency by skipping the computation of those weights that standard attention discards.

Related papers

Recycled Attention: Efficient inference for long-context language models [54.00118604124301]
We propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens. When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens. Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step.
arXiv Detail & Related papers (2024-11-08T18:57:07Z)
DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based Point-Level Consistency [12.881617910150688]
We propose a transformer framework for self-supervised learning called DenseDINO to learn dense visual representations. Specifically, DenseDINO introduces some extra input tokens called reference tokens to match the point-level features with the position prior. Compared with the vanilla DINO, our approach obtains competitive performance when evaluated on classification in ImageNet.
arXiv Detail & Related papers (2023-06-06T15:04:45Z)
Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning. We propose a new approach called context-aware fine-tuning. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z)
Attention as a Guide for Simultaneous Speech Translation [15.860792612311277]
We propose an attention-based policy (EDAtt) for simultaneous speech translation (SimulST) Its goal is to leverage the encoder-decoder attention scores to guide inference in real time. Results on en->de, es show that the EDAtt policy achieves overall better results compared to the SimulST state of the art.
arXiv Detail & Related papers (2022-12-15T14:18:53Z)
Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts. We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query. Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z)
Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task. We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs. We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z)
Bayesian Attention Belief Networks [59.183311769616466]
Attention-based neural networks have achieved state-of-the-art results on a wide range of tasks. This paper introduces Bayesian attention belief networks, which construct a decoder network by modeling unnormalized attention weights. We show that our method outperforms deterministic attention and state-of-the-art attention in accuracy, uncertainty estimation, generalization across domains, and adversarial attacks.
arXiv Detail & Related papers (2021-06-09T17:46:22Z)
Capturing Multi-Resolution Context by Dilated Self-Attention [58.69803243323346]
We propose a combination of restricted self-attention and a dilation mechanism, which we refer to as dilated self-attention. The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution. ASR results demonstrate substantial improvements compared to restricted self-attention alone, achieving similar results compared to full-sequence based self-attention with a fraction of the computational costs.
arXiv Detail & Related papers (2021-04-07T02:04:18Z)
Improving BERT with Syntax-aware Local Attention [14.70545694771721]
We propose a syntax-aware local attention, where the attention scopes are based on the distances in the syntactic structure. We conduct experiments on various single-sentence benchmarks, including sentence classification and sequence labeling tasks. Our model achieves better performance owing to more focused attention over syntactically relevant words.
arXiv Detail & Related papers (2020-12-30T13:29:58Z)
Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.