AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide
for Simultaneous Speech Translation
- URL: http://arxiv.org/abs/2305.11408v2
- Date: Thu, 20 Jul 2023 00:58:30 GMT
- Title: AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide
for Simultaneous Speech Translation
- Authors: Sara Papi, Marco Turchi, Matteo Negri
- Abstract summary: We propose a novel policy for simultaneous speech translation (SimulST) that exploits the attention information to generate source-target alignments.
We show that AlignAtt outperforms previous state-of-the-art SimulST policies applied to offline-trained models with gains in terms of BLEU of 2 points and latency reductions ranging from 0.5s to 0.8s across the 8 languages.
- Score: 15.860792612311277
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Attention is the core mechanism of today's most used architectures for
natural language processing and has been analyzed from many perspectives,
including its effectiveness for machine translation-related tasks. Among these
studies, attention resulted to be a useful source of information to get
insights about word alignment also when the input text is substituted with
audio segments, as in the case of the speech translation (ST) task. In this
paper, we propose AlignAtt, a novel policy for simultaneous ST (SimulST) that
exploits the attention information to generate source-target alignments that
guide the model during inference. Through experiments on the 8 language pairs
of MuST-C v1.0, we show that AlignAtt outperforms previous state-of-the-art
SimulST policies applied to offline-trained models with gains in terms of BLEU
of 2 points and latency reductions ranging from 0.5s to 0.8s across the 8
languages.
Related papers
- Do Audio-Language Models Understand Linguistic Variations? [42.17718387132912]
Open-vocabulary audio language models (ALMs) represent a promising new paradigm for audio-text retrieval using natural language queries.
We propose RobustCLAP, a novel and compute-efficient technique to learn audio-language representations to linguistic variations.
arXiv Detail & Related papers (2024-10-21T20:55:33Z) - Prosody in Cascade and Direct Speech-to-Text Translation: a case study
on Korean Wh-Phrases [79.07111754406841]
This work proposes using contrastive evaluation to measure the ability of direct S2TT systems to disambiguate utterances where prosody plays a crucial role.
Our results clearly demonstrate the value of direct translation systems over cascade translation models.
arXiv Detail & Related papers (2024-02-01T14:46:35Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Attention as a Guide for Simultaneous Speech Translation [15.860792612311277]
We propose an attention-based policy (EDAtt) for simultaneous speech translation (SimulST)
Its goal is to leverage the encoder-decoder attention scores to guide inference in real time.
Results on en->de, es show that the EDAtt policy achieves overall better results compared to the SimulST state of the art.
arXiv Detail & Related papers (2022-12-15T14:18:53Z) - Bridging the Data Gap between Training and Inference for Unsupervised
Neural Machine Translation [49.916963624249355]
A UNMT model is trained on the pseudo parallel data with translated source, and natural source sentences in inference.
The source discrepancy between training and inference hinders the translation performance of UNMT models.
We propose an online self-training approach, which simultaneously uses the pseudo parallel data natural source, translated target to mimic the inference scenario.
arXiv Detail & Related papers (2022-03-16T04:50:27Z) - ADIMA: Abuse Detection In Multilingual Audio [28.64185949388967]
Abusive content detection in spoken text can be addressed by performing Automatic Speech Recognition (ASR) and leveraging advancements in natural language processing.
We propose ADIMA, a novel, linguistically diverse, ethically sourced, expert annotated and well-balanced multilingual profanity detection audio dataset.
arXiv Detail & Related papers (2022-02-16T11:09:50Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Attention-based Contextual Language Model Adaptation for Speech
Recognition [13.516224963932858]
We introduce an attention mechanism for training neural speech recognition language models on text and non-linguistic contextual data.
Our method reduces perplexity by 7.0% relative over a standard LM that does not incorporate contextual information.
arXiv Detail & Related papers (2021-06-02T20:19:57Z) - Learning Shared Semantic Space for Speech-to-Text Translation [32.12445734213848]
We propose to bridge the modality gap between text machine translation (MT) and end-to-end speech translation (ST)
By projecting audio and text features to a common semantic representation, Chimera unifies MT and ST tasks.
Specifically, Chimera obtains 26.3 BLEU on EN-DE, improving the SOTA by a +2.7 BLEU margin.
arXiv Detail & Related papers (2021-05-07T07:49:56Z) - Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences.
Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem.
We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.