Improving End-to-End Speech Translation by Imitation-Based Knowledge
Distillation with Synthetic Transcripts
- URL: http://arxiv.org/abs/2307.08426v1
- Date: Mon, 17 Jul 2023 12:14:45 GMT
- Title: Improving End-to-End Speech Translation by Imitation-Based Knowledge
Distillation with Synthetic Transcripts
- Authors: Rebekka Hubert and Artem Sokolov and Stefan Riezler
- Abstract summary: We present an imitation learning approach where a teacher NMT system corrects the errors of an AST student without relying on manual transcripts.
We show that the NMT teacher can recover from errors in automatic transcriptions and is able to correct erroneous translations of the AST student.
- Score: 12.097786953347828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end automatic speech translation (AST) relies on data that combines
audio inputs with text translation outputs. Previous work used existing large
parallel corpora of transcriptions and translations in a knowledge distillation
(KD) setup to distill a neural machine translation (NMT) into an AST student
model. While KD allows using larger pretrained models, the reliance of previous
KD approaches on manual audio transcripts in the data pipeline restricts the
applicability of this framework to AST. We present an imitation learning
approach where a teacher NMT system corrects the errors of an AST student
without relying on manual transcripts. We show that the NMT teacher can recover
from errors in automatic transcriptions and is able to correct erroneous
translations of the AST student, leading to improvements of about 4 BLEU points
over the standard AST end-to-end baseline on the English-German CoVoST-2 and
MuST-C datasets, respectively. Code and data are publicly
available.\footnote{\url{https://github.com/HubReb/imitkd_ast/releases/tag/v1.1}}
Related papers
- Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - Back Translation for Speech-to-text Translation Without Transcripts [11.13240570688547]
We develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data.
To ease the challenges posed by short-to-long generation and one-to-many mapping, we introduce self-supervised discrete units.
With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets.
arXiv Detail & Related papers (2023-05-15T15:12:40Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - The NiuTrans End-to-End Speech Translation System for IWSLT 2021 Offline
Task [23.008938777422767]
This paper describes the submission of the NiuTrans end-to-end speech translation system for the IWSLT 2021 offline task.
We use the Transformer-based model architecture and enhance it by Conformer, relative position encoding, and stacked acoustic and textual encoding.
We achieve 33.84 BLEU points on the MuST-C En-De test set, which shows the enormous potential of the end-to-end model.
arXiv Detail & Related papers (2021-07-06T07:45:23Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in
Non-Autoregressive Translation [98.11249019844281]
Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models.
We propose reverse KD to rejuvenate more alignments for low-frequency target words.
Results demonstrate that the proposed approach can significantly and universally improve translation quality.
arXiv Detail & Related papers (2021-06-02T02:41:40Z) - Source and Target Bidirectional Knowledge Distillation for End-to-end
Speech Translation [88.78138830698173]
We focus on sequence-level knowledge distillation (SeqKD) from external text-based NMT models.
We train a bilingual E2E-ST model to predict paraphrased transcriptions as an auxiliary task with a single decoder.
arXiv Detail & Related papers (2021-04-13T19:00:51Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.