T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine
Translation
- URL: http://arxiv.org/abs/2205.12216v1
- Date: Tue, 24 May 2022 17:23:35 GMT
- Title: T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine
Translation
- Authors: Paul-Ambroise Duquenne, Hongyu Gong, Beno\^it Sagot, Holger Schwenk
- Abstract summary: We present a new approach to perform zero-shot cross-modal transfer between speech and text for translation tasks.
Multilingual speech and text are encoded in a joint fixed-size representation space.
We compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities.
- Score: 19.332953510406327
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a new approach to perform zero-shot cross-modal transfer between
speech and text for translation tasks. Multilingual speech and text are encoded
in a joint fixed-size representation space. Then, we compare different
approaches to decode these multimodal and multilingual fixed-size
representations, enabling zero-shot translation between languages and
modalities. All our models are trained without the need of cross-modal labeled
translation data. Despite a fixed-size representation, we achieve very
competitive results on several text and speech translation tasks. In
particular, we significantly improve the state-of-the-art for zero-shot speech
translation on Must-C. Incorporating a speech decoder in our framework, we
introduce the first results for zero-shot direct speech-to-speech and
text-to-speech translation.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer [22.744544108288274]
In this work, we show that this type of approach can be further improved with multilingual training.
We observe significant improvements in zero-shot cross-modal speech translation, even outperforming a supervised approach based on XLSR for several languages.
arXiv Detail & Related papers (2023-10-05T17:44:37Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [66.92823764664206]
We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text.
While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation.
Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
arXiv Detail & Related papers (2022-07-03T04:26:53Z) - mSLAM: Massively multilingual joint pre-training for speech and text [43.32334037420761]
mSLAM learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages.
We find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID.
arXiv Detail & Related papers (2022-02-03T02:26:40Z) - Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining
and Speech Translation [21.622039537743607]
We propose a Fused Acoustic and Text Masked Language Model (FAT-MLM) which jointly learns a unified representation for both acoustic and text in-put.
Experiments on three translation directions show that our proposed speech translation models fine-tuned from FAT-MLM substantially improve translation quality.
arXiv Detail & Related papers (2021-02-10T22:53:40Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.