STEMM: Self-learning with Speech-text Manifold Mixup for Speech
Translation
- URL: http://arxiv.org/abs/2203.10426v1
- Date: Sun, 20 Mar 2022 01:49:53 GMT
- Title: STEMM: Self-learning with Speech-text Manifold Mixup for Speech
Translation
- Authors: Qingkai Fang, Rong Ye, Lei Li, Yang Feng, Mingxuan Wang
- Abstract summary: We propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy.
Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy.
- Score: 37.51435498386953
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How to learn a better speech representation for end-to-end speech-to-text
translation (ST) with limited labeled data? Existing techniques often attempt
to transfer powerful machine translation (MT) capabilities to ST, but neglect
the representation discrepancy across modalities. In this paper, we propose the
Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy.
Specifically, we mix up the representation sequences of different modalities,
and take both unimodal speech sequences and multimodal mixed sequences as input
to the translation model in parallel, and regularize their output predictions
with a self-learning framework. Experiments on MuST-C speech translation
benchmark and further analysis show that our method effectively alleviates the
cross-modal representation discrepancy, and achieves significant improvements
over a strong baseline on eight translation directions.
Related papers
- TMT: Tri-Modal Translation between Speech, Image, and Text by Processing
Different Modalities as Different Languages [96.8603701943286]
Tri-Modal Translation (TMT) model translates between arbitrary modalities spanning speech, image, and text.
We tokenize speech and image data into discrete tokens, which provide a unified interface across modalities.
TMT outperforms single model counterparts consistently.
arXiv Detail & Related papers (2024-02-25T07:46:57Z) - Soft Alignment of Modality Space for End-to-end Speech Translation [49.29045524083467]
End-to-end Speech Translation aims to convert speech into target text within a unified model.
The inherent differences between speech and text modalities often impede effective cross-modal and cross-lingual transfer.
We introduce Soft Alignment (S-Align), using adversarial training to align the representation spaces of both modalities.
arXiv Detail & Related papers (2023-12-18T06:08:51Z) - Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard
Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing.
Our method reduces the speech-text modality gap via a pre-processing stage.
We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z) - Improving speech translation by fusing speech and text [24.31233927318388]
We harness the complementary strengths of speech and text, which are disparate modalities.
We propose textbfFuse-textbfSpeech-textbfText (textbfFST), a cross-modal model which supports three distinct input modalities for translation.
arXiv Detail & Related papers (2023-05-23T13:13:48Z) - Understanding and Bridging the Modality Gap for Speech Translation [11.13240570688547]
Multi-task learning is one of the effective ways to share knowledge between machine translation (MT) and end-to-end speech translation (ST)
However, due to the differences between speech and text, there is always a gap between ST and MT.
In this paper, we first aim to understand this modality gap from the target-side representation differences, and link the modality gap to another well-known problem in neural machine translation: exposure bias.
arXiv Detail & Related papers (2023-05-15T15:09:18Z) - Tackling Ambiguity with Images: Improved Multimodal Machine Translation
and Contrastive Evaluation [72.6667341525552]
We present a new MMT approach based on a strong text-only MT model, which uses neural adapters and a novel guided self-attention mechanism.
We also introduce CoMMuTE, a Contrastive Multimodal Translation Evaluation set of ambiguous sentences and their possible translations.
Our approach obtains competitive results compared to strong text-only models on standard English-to-French, English-to-German and English-to-Czech benchmarks.
arXiv Detail & Related papers (2022-12-20T10:18:18Z) - M3ST: Mix at Three Levels for Speech Translation [66.71994367650461]
We propose Mix at three levels for Speech Translation (M3ST) method to increase the diversity of the augmented training corpus.
In the first stage of fine-tuning, we mix the training corpus at three levels, including word level, sentence level and frame level, and fine-tune the entire model with mixed data.
Experiments on MuST-C speech translation benchmark and analysis show that M3ST outperforms current strong baselines and achieves state-of-the-art results on eight directions with an average BLEU of 29.9.
arXiv Detail & Related papers (2022-12-07T14:22:00Z) - mSLAM: Massively multilingual joint pre-training for speech and text [43.32334037420761]
mSLAM learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages.
We find that joint pre-training with text improves quality on speech translation, speech intent classification and speech language-ID.
arXiv Detail & Related papers (2022-02-03T02:26:40Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.