Attention as a Guide for Simultaneous Speech Translation
- URL: http://arxiv.org/abs/2212.07850v2
- Date: Thu, 11 May 2023 10:15:18 GMT
- Title: Attention as a Guide for Simultaneous Speech Translation
- Authors: Sara Papi, Matteo Negri, Marco Turchi
- Abstract summary: We propose an attention-based policy (EDAtt) for simultaneous speech translation (SimulST)
Its goal is to leverage the encoder-decoder attention scores to guide inference in real time.
Results on en->de, es show that the EDAtt policy achieves overall better results compared to the SimulST state of the art.
- Score: 15.860792612311277
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The study of the attention mechanism has sparked interest in many fields,
such as language modeling and machine translation. Although its patterns have
been exploited to perform different tasks, from neural network understanding to
textual alignment, no previous work has analysed the encoder-decoder attention
behavior in speech translation (ST) nor used it to improve ST on a specific
task. In this paper, we fill this gap by proposing an attention-based policy
(EDAtt) for simultaneous ST (SimulST) that is motivated by an analysis of the
existing attention relations between audio input and textual output. Its goal
is to leverage the encoder-decoder attention scores to guide inference in real
time. Results on en->{de, es} show that the EDAtt policy achieves overall
better results compared to the SimulST state of the art, especially in terms of
computational-aware latency.
Related papers
- STAB: Speech Tokenizer Assessment Benchmark [57.45234921100835]
Representing speech as discrete tokens provides a framework for transforming speech into a format that closely resembles text.
We present STAB (Speech Tokenizer Assessment Benchmark), a systematic evaluation framework designed to assess speech tokenizers comprehensively.
We evaluate the STAB metrics and correlate this with downstream task performance across a range of speech tasks and tokenizer choices.
arXiv Detail & Related papers (2024-09-04T02:20:59Z) - Rethinking and Improving Multi-task Learning for End-to-end Speech
Translation [51.713683037303035]
We investigate the consistency between different tasks, considering different times and modules.
We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations.
We propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation.
arXiv Detail & Related papers (2023-11-07T08:48:46Z) - Scene Graph as Pivoting: Inference-time Image-free Unsupervised
Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup.
We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics.
Several SG-pivoting based learning objectives are introduced for unsupervised translation training.
Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z) - AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide
for Simultaneous Speech Translation [15.860792612311277]
We propose a novel policy for simultaneous speech translation (SimulST) that exploits the attention information to generate source-target alignments.
We show that AlignAtt outperforms previous state-of-the-art SimulST policies applied to offline-trained models with gains in terms of BLEU of 2 points and latency reductions ranging from 0.5s to 0.8s across the 8 languages.
arXiv Detail & Related papers (2023-05-19T03:31:42Z) - On the Locality of Attention in Direct Speech Translation [0.1749935196721634]
Transformers have achieved state-of-the-art results across multiple NLP tasks.
We discuss the usefulness of self-attention for Direct Speech Translation.
arXiv Detail & Related papers (2022-04-19T17:43:37Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Visualization: the missing factor in Simultaneous Speech Translation [14.454116027072335]
Simultaneous speech translation (SimulST) is a task in which output generation has to be performed on partial, incremental speech input.
SimulST has become popular due to the spread of cross-lingual application scenarios.
arXiv Detail & Related papers (2021-10-31T14:44:01Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Improving BERT with Syntax-aware Local Attention [14.70545694771721]
We propose a syntax-aware local attention, where the attention scopes are based on the distances in the syntactic structure.
We conduct experiments on various single-sentence benchmarks, including sentence classification and sequence labeling tasks.
Our model achieves better performance owing to more focused attention over syntactically relevant words.
arXiv Detail & Related papers (2020-12-30T13:29:58Z) - Salience Estimation with Multi-Attention Learning for Abstractive Text
Summarization [86.45110800123216]
In the task of text summarization, salience estimation for words, phrases or sentences is a critical component.
We propose a Multi-Attention Learning framework which contains two new attention learning components for salience estimation.
arXiv Detail & Related papers (2020-04-07T02:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.