ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource
Speech Translation Tasks
- URL: http://arxiv.org/abs/2205.01987v1
- Date: Wed, 4 May 2022 10:36:57 GMT
- Title: ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource
Speech Translation Tasks
- Authors: Marcely Zanon Boito, John Ortega, Hugo Riguidel, Antoine Laurent,
Lo\"ic Barrault, Fethi Bougares, Firas Chaabani, Ha Nguyen, Florentin
Barbier, Souhir Gahbiche, Yannick Est\`eve
- Abstract summary: This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation.
We build an end-to-end model as our joint primary submission, and compare it against cascaded models that leverage a large fine-tuned wav2vec 2.0 model for ASR.
Our results highlight that self-supervised models trained on smaller sets of target data are more effective to low-resource end-to-end ST fine-tuning, compared to large off-the-shelf models.
- Score: 8.651248939672769
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes the ON-TRAC Consortium translation systems developed for
two challenge tracks featured in the Evaluation Campaign of IWSLT 2022:
low-resource and dialect speech translation. For the Tunisian Arabic-English
dataset (low-resource and dialect tracks), we build an end-to-end model as our
joint primary submission, and compare it against cascaded models that leverage
a large fine-tuned wav2vec 2.0 model for ASR. Our results show that in our
settings pipeline approaches are still very competitive, and that with the use
of transfer learning, they can outperform end-to-end models for speech
translation (ST). For the Tamasheq-French dataset (low-resource track) our
primary submission leverages intermediate representations from a wav2vec 2.0
model trained on 234 hours of Tamasheq audio, while our contrastive model uses
a French phonetic transcription of the Tamasheq audio as input in a Conformer
speech translation architecture jointly trained on automatic speech
recognition, ST and machine translation losses. Our results highlight that
self-supervised models trained on smaller sets of target data are more
effective to low-resource end-to-end ST fine-tuning, compared to large
off-the-shelf models. Results also illustrate that even approximate phonetic
transcriptions can improve ST scores.
Related papers
- Coupling Speech Encoders with Downstream Text Models [4.679869237248675]
We present a modular approach to building cascade speech translation models.
We preserve state-of-the-art speech recognition (ASR) and text translation (MT) performance for a given task.
arXiv Detail & Related papers (2024-07-24T19:29:13Z) - NAIST Simultaneous Speech Translation System for IWSLT 2024 [18.77311658086372]
This paper describes NAIST's submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign.
We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART.
We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt.
Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module.
arXiv Detail & Related papers (2024-06-30T20:41:02Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models.
We thus obtain an audio-textual (multimodal) representation for each data sample.
We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - Back Translation for Speech-to-text Translation Without Transcripts [11.13240570688547]
We develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data.
To ease the challenges posed by short-to-long generation and one-to-many mapping, we introduce self-supervised discrete units.
With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets.
arXiv Detail & Related papers (2023-05-15T15:12:40Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - The USYD-JD Speech Translation System for IWSLT 2021 [85.64797317290349]
This paper describes the University of Sydney& JD's joint submission of the IWSLT 2021 low resource speech translation task.
We trained our models with the officially provided ASR and MT datasets.
To achieve better translation performance, we explored the most recent effective strategies, including back translation, knowledge distillation, multi-feature reranking and transductive finetuning.
arXiv Detail & Related papers (2021-07-24T09:53:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.