Pushing the Limits of Zero-shot End-to-End Speech Translation
- URL: http://arxiv.org/abs/2402.10422v2
- Date: Wed, 5 Jun 2024 20:56:20 GMT
- Title: Pushing the Limits of Zero-shot End-to-End Speech Translation
- Authors: Ioannis Tsiamas, Gerard I. Gállego, José A. R. Fonollosa, Marta R. Costa-jussà,
- Abstract summary: Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems.
We introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data.
Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method's superiority.
- Score: 15.725310520335785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems, thus hindering their performance. Prior work has attempted to mitigate these challenges by leveraging external MT data and optimizing distance metrics that bring closer the speech-text representations. However, achieving competitive results typically requires some ST data. For this reason, we introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data. Leveraging a novel CTC compression and Optimal Transport, we train a speech encoder using only ASR data, to align with the representation space of a massively multilingual MT model. The speech encoder seamlessly integrates with the MT model at inference, enabling direct translation from speech to text, across all languages supported by the MT model. Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method's superiority over not only previous zero-shot models, but also supervised ones, achieving state-of-the-art results.
Related papers
- Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM [48.71951982716363]
Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems.
We propose Hard- Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS.
Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data.
arXiv Detail & Related papers (2024-11-20T09:49:37Z) - Towards Zero-Shot Multimodal Machine Translation [64.9141931372384]
We propose a method to bypass the need for fully supervised data to train multimodal machine translation systems.
Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives.
To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese.
arXiv Detail & Related papers (2024-07-18T15:20:31Z) - Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z) - DUB: Discrete Unit Back-translation for Speech Translation [32.74997208667928]
We propose Discrete Unit Back-translation (DUB) to answer two questions: Is it better to represent speech with discrete units than with continuous features in direct ST?
With DUB, the back-translation technique can successfully be applied on direct ST and obtains an average boost of 5.5 BLEU on MuST-C En-De/Fr/Es.
In the low-resource language scenario, our method achieves comparable performance to existing methods that rely on large-scale external data.
arXiv Detail & Related papers (2023-05-19T03:48:16Z) - Back Translation for Speech-to-text Translation Without Transcripts [11.13240570688547]
We develop a back translation algorithm for ST (BT4ST) to synthesize pseudo ST data from monolingual target data.
To ease the challenges posed by short-to-long generation and one-to-many mapping, we introduce self-supervised discrete units.
With our synthetic ST data, we achieve an average boost of 2.3 BLEU on MuST-C En-De, En-Fr, and En-Es datasets.
arXiv Detail & Related papers (2023-05-15T15:12:40Z) - AdaTranS: Adapting with Boundary-based Shrinking for End-to-End Speech
Translation [36.12146100483228]
AdaTranS adapts the speech features with a new shrinking mechanism to mitigate the length mismatch between speech and text features.
Experiments on the MUST-C dataset demonstrate that AdaTranS achieves better performance than the other shrinking-based methods.
arXiv Detail & Related papers (2022-12-17T16:14:30Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Zero-shot Speech Translation [0.0]
Speech Translation (ST) is the task of translating speech in one language into text in another language.
End-to-end approaches use only one system to avoid propagating error, yet are difficult to employ due to data scarcity.
We explore zero-shot translation, which enables translating a pair of languages that is unseen during training.
arXiv Detail & Related papers (2021-07-13T12:00:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.