Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer
- URL: http://arxiv.org/abs/2310.03724v1
- Date: Thu, 5 Oct 2023 17:44:37 GMT
- Title: Modular Speech-to-Text Translation for Zero-Shot Cross-Modal Transfer
- Authors: Paul-Ambroise Duquenne, Holger Schwenk, Beno\^it Sagot
- Abstract summary: In this work, we show that this type of approach can be further improved with multilingual training.
We observe significant improvements in zero-shot cross-modal speech translation, even outperforming a supervised approach based on XLSR for several languages.
- Score: 22.744544108288274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has shown that independently trained encoders and decoders,
combined through a shared fixed-size representation, can achieve competitive
performance in speech-to-text translation. In this work, we show that this type
of approach can be further improved with multilingual training. We observe
significant improvements in zero-shot cross-modal speech translation, even
outperforming a supervised approach based on XLSR for several languages.
Related papers
- Cross-Lingual Transfer Learning for Speech Translation [7.802021866251242]
This paper examines how to expand the speech translation capability of speech foundation models with restricted data.
Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model.
Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space.
arXiv Detail & Related papers (2024-07-01T09:51:48Z) - Adaptive Cross-lingual Text Classification through In-Context One-Shot Demonstrations [47.89819316477715]
We exploit In-Context Tuning (ICT) for One-Shot Cross-lingual transfer in the classification task by introducing In-Context Cross-lingual Transfer (IC-XLT)
The novel concept involves training a model to learn from context examples and subsequently adapting it during inference to a target language by prepending a One-Shot context demonstration in that language.
Our results show that IC-XLT successfully leverages target-language examples to improve the cross-lingual capabilities of the evaluated mT5 model, outperforming prompt-based models in the Zero and Few-shot scenarios adapted through fine-tuning.
arXiv Detail & Related papers (2024-04-03T04:40:57Z) - RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training [84.23022072347821]
We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs.
Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-13T14:41:05Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine
Translation [19.332953510406327]
We present a new approach to perform zero-shot cross-modal transfer between speech and text for translation tasks.
Multilingual speech and text are encoded in a joint fixed-size representation space.
We compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities.
arXiv Detail & Related papers (2022-05-24T17:23:35Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - mSLAM: Massively multilingual joint pre-training for speech and text [43.32334037420761]
mSLAM learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages.
We find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID.
arXiv Detail & Related papers (2022-02-03T02:26:40Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance.
Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.