On the Locality of Attention in Direct Speech Translation
- URL: http://arxiv.org/abs/2204.09028v1
- Date: Tue, 19 Apr 2022 17:43:37 GMT
- Title: On the Locality of Attention in Direct Speech Translation
- Authors: Belen Alastruey, Javier Ferrando, Gerard I. G\'allego and Marta R.
Costa-juss\`a
- Abstract summary: Transformers have achieved state-of-the-art results across multiple NLP tasks.
We discuss the usefulness of self-attention for Direct Speech Translation.
- Score: 0.1749935196721634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have achieved state-of-the-art results across multiple NLP
tasks. However, the self-attention mechanism complexity scales quadratically
with the sequence length, creating an obstacle for tasks involving long
sequences, like in the speech domain. In this paper, we discuss the usefulness
of self-attention for Direct Speech Translation. First, we analyze the
layer-wise token contributions in the self-attention of the encoder, unveiling
local diagonal patterns. To prove that some attention weights are avoidable, we
propose to substitute the standard self-attention with a local efficient one,
setting the amount of context used based on the results of the analysis. With
this approach, our model matches the baseline performance, and improves the
efficiency by skipping the computation of those weights that standard attention
discards.
Related papers
- DenseDINO: Boosting Dense Self-Supervised Learning with Token-Based
Point-Level Consistency [12.881617910150688]
We propose a transformer framework for self-supervised learning called DenseDINO to learn dense visual representations.
Specifically, DenseDINO introduces some extra input tokens called reference tokens to match the point-level features with the position prior.
Compared with the vanilla DINO, our approach obtains competitive performance when evaluated on classification in ImageNet.
arXiv Detail & Related papers (2023-06-06T15:04:45Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z) - Attention as a Guide for Simultaneous Speech Translation [15.860792612311277]
We propose an attention-based policy (EDAtt) for simultaneous speech translation (SimulST)
Its goal is to leverage the encoder-decoder attention scores to guide inference in real time.
Results on en->de, es show that the EDAtt policy achieves overall better results compared to the SimulST state of the art.
arXiv Detail & Related papers (2022-12-15T14:18:53Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Progressively Guide to Attend: An Iterative Alignment Framework for
Temporal Sentence Grounding [53.377028000325424]
We propose an Iterative Alignment Network (IA-Net) for temporal sentence grounding task.
We pad multi-modal features with learnable parameters to alleviate the nowhere-to-attend problem of non-matched frame-word pairs.
We also devise a calibration module following each attention module to refine the alignment knowledge.
arXiv Detail & Related papers (2021-09-14T02:08:23Z) - Bayesian Attention Belief Networks [59.183311769616466]
Attention-based neural networks have achieved state-of-the-art results on a wide range of tasks.
This paper introduces Bayesian attention belief networks, which construct a decoder network by modeling unnormalized attention weights.
We show that our method outperforms deterministic attention and state-of-the-art attention in accuracy, uncertainty estimation, generalization across domains, and adversarial attacks.
arXiv Detail & Related papers (2021-06-09T17:46:22Z) - Capturing Multi-Resolution Context by Dilated Self-Attention [58.69803243323346]
We propose a combination of restricted self-attention and a dilation mechanism, which we refer to as dilated self-attention.
The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution.
ASR results demonstrate substantial improvements compared to restricted self-attention alone, achieving similar results compared to full-sequence based self-attention with a fraction of the computational costs.
arXiv Detail & Related papers (2021-04-07T02:04:18Z) - Improving BERT with Syntax-aware Local Attention [14.70545694771721]
We propose a syntax-aware local attention, where the attention scopes are based on the distances in the syntactic structure.
We conduct experiments on various single-sentence benchmarks, including sentence classification and sequence labeling tasks.
Our model achieves better performance owing to more focused attention over syntactically relevant words.
arXiv Detail & Related papers (2020-12-30T13:29:58Z) - Learning Hard Retrieval Decoder Attention for Transformers [69.40942736249397]
Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily.
We show that our hard retrieval attention mechanism is 1.43 times faster in decoding.
arXiv Detail & Related papers (2020-09-30T13:18:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.