Sample, Translate, Recombine: Leveraging Audio Alignments for Data
Augmentation in End-to-end Speech Translation
- URL: http://arxiv.org/abs/2203.08757v1
- Date: Wed, 16 Mar 2022 17:15:46 GMT
- Title: Sample, Translate, Recombine: Leveraging Audio Alignments for Data
Augmentation in End-to-end Speech Translation
- Authors: Tsz Kin Lam, Shigehiko Schamoni, Stefan Riezler
- Abstract summary: We present a novel approach to data augmentation that leverages audio alignments, linguistic properties, and translation.
Our method delivers consistent improvements of up to 0.9 and 1.1 BLEU points on five language pairs on CoVoST 2 and on two language pairs on Europarl-ST, respectively.
- Score: 14.839931533868176
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-end speech translation relies on data that pair source-language speech
inputs with corresponding translations into a target language. Such data are
notoriously scarce, making synthetic data augmentation by back-translation or
knowledge distillation a necessary ingredient of end-to-end training. In this
paper, we present a novel approach to data augmentation that leverages audio
alignments, linguistic properties, and translation. First, we augment a
transcription by sampling from a suffix memory that stores text and audio data.
Second, we translate the augmented transcript. Finally, we recombine
concatenated audio segments and the generated translation. Besides training an
MT-system, we only use basic off-the-shelf components without fine-tuning.
While having similar resource demands as knowledge distillation, adding our
method delivers consistent improvements of up to 0.9 and 1.1 BLEU points on
five language pairs on CoVoST 2 and on two language pairs on Europarl-ST,
respectively.
Related papers
- A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation [38.88908101517807]
Our research introduces a novel, carefully curated multilingual dataset from various movie audio tracks.
Each dataset pair is precisely matched for paralinguistic information and duration.
We enhance this by integrating multiple prosody transfer techniques, aiming for translations that are accurate, natural-sounding, and rich in paralinguistic details.
arXiv Detail & Related papers (2025-02-01T09:24:32Z) - Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues [56.038123093599815]
Our objective is to translate continuous sign language into spoken language text.
We incorporate additional contextual cues together with the signing video.
We show that our contextual approach significantly enhances the quality of the translations.
arXiv Detail & Related papers (2025-01-16T18:59:03Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Towards a Deep Understanding of Multilingual End-to-End Speech
Translation [52.26739715012842]
We analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages.
We derive three major findings from our analysis.
arXiv Detail & Related papers (2023-10-31T13:50:55Z) - End-to-End Speech Translation of Arabic to English Broadcast News [2.375764121997739]
Speech translation (ST) is the task of translating acoustic speech signals in a source language into text in a foreign language.
This paper presents our efforts towards the development of the first Broadcast News end-to-end Arabic to English speech translation system.
arXiv Detail & Related papers (2022-12-11T11:35:46Z) - Improving End-to-end Speech Translation by Leveraging Auxiliary Speech
and Text Data [38.816953592085156]
We present a method for introducing a text encoder into pre-trained end-to-end speech translation systems.
It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text)
arXiv Detail & Related papers (2022-12-04T09:27:56Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.