Cross-modal Contrastive Learning for Speech Translation
- URL: http://arxiv.org/abs/2205.02444v1
- Date: Thu, 5 May 2022 05:14:01 GMT
- Title: Cross-modal Contrastive Learning for Speech Translation
- Authors: Rong Ye, Mingxuan Wang, Lei Li
- Abstract summary: ConST is a cross-modal contrastive learning method for end-to-end speech-to-text translation.
Experiments show that the proposed ConST consistently outperforms the previous methods on.
Its learned representation improves the accuracy of cross-modal speech-text retrieval from 4% to 88%.
- Score: 36.63604508886932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How can we learn unified representations for spoken utterances and their
written text? Learning similar representations for semantically similar speech
and text is important for speech translation. To this end, we propose ConST, a
cross-modal contrastive learning method for end-to-end speech-to-text
translation. We evaluate ConST and a variety of previous baselines on a popular
benchmark MuST-C. Experiments show that the proposed ConST consistently
outperforms the previous methods on, and achieves an average BLEU of 29.4. The
analysis further verifies that ConST indeed closes the representation gap of
different modalities -- its learned representation improves the accuracy of
cross-modal speech-text retrieval from 4% to 88%. Code and models are available
at https://github.com/ReneeYe/ConST.
Related papers
- Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)
Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)
We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought [33.32415197728357]
Speech Language Models (SLMs) have demonstrated impressive performance on speech translation tasks.
We introduce a three-stage training framework designed to activate the chain-of-thought capabilities of SLMs.
We propose CoT-ST, a speech translation model that utilizes multimodal CoT to decompose speech translation into sequential steps of speech recognition and translation.
arXiv Detail & Related papers (2024-09-29T01:48:09Z) - FASST: Fast LLM-based Simultaneous Speech Translation [9.65638081954595]
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly.
We propose FASST, a fast large language model based method for streaming speech translation.
Experiment results show that FASST achieves the best quality-latency trade-off.
arXiv Detail & Related papers (2024-08-18T10:12:39Z) - Rethinking and Improving Multi-task Learning for End-to-end Speech
Translation [51.713683037303035]
We investigate the consistency between different tasks, considering different times and modules.
We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations.
We propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation.
arXiv Detail & Related papers (2023-11-07T08:48:46Z) - WACO: Word-Aligned Contrastive Learning for Speech Translation [11.67083845641806]
Speech Translation (E2E) aims to directly translate source speech into target text.
Existing ST methods perform poorly when only extremely small speech-text data are available for training.
We propose Word-Aligned COntrastive learning (WACO), a simple and effective method for extremely low-resource speech-to-text translation.
arXiv Detail & Related papers (2022-12-19T10:49:35Z) - BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric [66.73705349465207]
End-to-end speech-to-speech translation (S2ST) is generally evaluated with text-based metrics.
We propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.
arXiv Detail & Related papers (2022-12-16T14:00:26Z) - STEMM: Self-learning with Speech-text Manifold Mixup for Speech
Translation [37.51435498386953]
We propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy.
Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy.
arXiv Detail & Related papers (2022-03-20T01:49:53Z) - Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech [8.465993273653554]
We investigate the use of a multi-speaker Text-To-Speech system to synthesize speech in support of speaker recognition.
We observe on our datasets that TTS synthesized speech improves cross-domain speaker recognition performance.
We also explore the effectiveness of different types of text transcripts used for TTS synthesis.
arXiv Detail & Related papers (2020-11-24T00:48:54Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.