Tackling data scarcity in speech translation using zero-shot
multilingual machine translation techniques
- URL: http://arxiv.org/abs/2201.11172v1
- Date: Wed, 26 Jan 2022 20:20:59 GMT
- Title: Tackling data scarcity in speech translation using zero-shot
multilingual machine translation techniques
- Authors: Tu Anh Dinh, Danni Liu, Jan Niehues
- Abstract summary: Several techniques have been proposed for zero-shot translation.
We investigate whether these ideas can be applied to speech translation, by building ST models trained on speech transcription and text translation data.
The techniques were successfully applied to few-shot ST using limited ST data, with improvements of up to +12.9 BLEU points compared to direct end-to-end ST and +3.1 BLEU points compared to ST models fine-tuned from ASR model.
- Score: 12.968557512440759
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, end-to-end speech translation (ST) has gained significant attention
as it avoids error propagation. However, the approach suffers from data
scarcity. It heavily depends on direct ST data and is less efficient in making
use of speech transcription and text translation data, which is often more
easily available. In the related field of multilingual text translation,
several techniques have been proposed for zero-shot translation. A main idea is
to increase the similarity of semantically similar sentences in different
languages. We investigate whether these ideas can be applied to speech
translation, by building ST models trained on speech transcription and text
translation data. We investigate the effects of data augmentation and auxiliary
loss function. The techniques were successfully applied to few-shot ST using
limited ST data, with improvements of up to +12.9 BLEU points compared to
direct end-to-end ST and +3.1 BLEU points compared to ST models fine-tuned from
ASR model.
Related papers
- Pushing the Limits of Zero-shot End-to-End Speech Translation [15.725310520335785]
Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems.
We introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data.
Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method's superiority.
arXiv Detail & Related papers (2024-02-16T03:06:37Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - Back Translation for Speech-to-text Translation Without Transcripts [11.13240570688547]
We develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data.
To ease the challenges posed by short-to-long generation and one-to-many mapping, we introduce self-supervised discrete units.
With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets.
arXiv Detail & Related papers (2023-05-15T15:12:40Z) - Improving Cascaded Unsupervised Speech Translation with Denoising
Back-translation [70.33052952571884]
We propose to build a cascaded speech translation system without leveraging any kind of paired data.
We use fully unpaired data to train our unsupervised systems and evaluate our results on CoVoST 2 and CVSS.
arXiv Detail & Related papers (2023-05-12T13:07:51Z) - AdaTranS: Adapting with Boundary-based Shrinking for End-to-End Speech
Translation [36.12146100483228]
AdaTranS adapts the speech features with a new shrinking mechanism to mitigate the length mismatch between speech and text features.
Experiments on the MUST-C dataset demonstrate that AdaTranS achieves better performance than the other shrinking-based methods.
arXiv Detail & Related papers (2022-12-17T16:14:30Z) - Large-Scale Streaming End-to-End Speech Translation with Neural
Transducers [35.2855796745394]
We introduce a streaming end-to-end speech translation (ST) model to convert audio signals to texts in other languages directly.
Compared with cascaded ST that performs ASR followed by text-based machine translation (MT), the proposed Transformer transducer (TT)-based ST model drastically reduces inference latency.
We extend TT-based ST to multilingual ST, which generates texts of multiple languages at the same time.
arXiv Detail & Related papers (2022-04-11T18:18:53Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Zero-shot Speech Translation [0.0]
Speech Translation (ST) is the task of translating speech in one language into text in another language.
End-to-end approaches use only one system to avoid propagating error, yet are difficult to employ due to data scarcity.
We explore zero-shot translation, which enables translating a pair of languages that is unseen during training.
arXiv Detail & Related papers (2021-07-13T12:00:44Z) - Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in
Non-Autoregressive Translation [98.11249019844281]
Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models.
We propose reverse KD to rejuvenate more alignments for low-frequency target words.
Results demonstrate that the proposed approach can significantly and universally improve translation quality.
arXiv Detail & Related papers (2021-06-02T02:41:40Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.