Related papers: Cross-modal Contrastive Learning for Speech Translation

Cross-modal Contrastive Learning for Speech Translation

URL: http://arxiv.org/abs/2205.02444v1
Date: Thu, 5 May 2022 05:14:01 GMT
Title: Cross-modal Contrastive Learning for Speech Translation
Authors: Rong Ye, Mingxuan Wang, Lei Li
Abstract summary: ConST is a cross-modal contrastive learning method for end-to-end speech-to-text translation. Experiments show that the proposed ConST consistently outperforms the previous methods on. Its learned representation improves the accuracy of cross-modal speech-text retrieval from 4% to 88%.
Score: 36.63604508886932
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: How can we learn unified representations for spoken utterances and their written text? Learning similar representations for semantically similar speech and text is important for speech translation. To this end, we propose ConST, a cross-modal contrastive learning method for end-to-end speech-to-text translation. We evaluate ConST and a variety of previous baselines on a popular benchmark MuST-C. Experiments show that the proposed ConST consistently outperforms the previous methods on, and achieves an average BLEU of 29.4. The analysis further verifies that ConST indeed closes the representation gap of different modalities -- its learned representation improves the accuracy of cross-modal speech-text retrieval from 4% to 88%. Code and models are available at https://github.com/ReneeYe/ConST.

Related papers

OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching [3.05024318465243]
OZSpeech is the first TTS method to explore optimal transport conditional flow matching with one-step sampling.<n>Our approach operates on disentangled, factorized components of speech in token format, enabling accurate modeling of each speech attribute.<n> Experimental results show that our method achieves promising performance over existing methods in content accuracy, naturalness, prosody generation, and speaker style preservation.
arXiv Detail & Related papers (2025-05-19T07:31:55Z)
Representation Purification for End-to-End Speech Translation [16.967317436711113]
Speech-to-text translation (ST) is a cross-modal task that involves converting spoken language into text in a different language. We conceptualize speech representation as a combination of content-agnostic and content-relevant factors.
arXiv Detail & Related papers (2024-12-05T15:50:44Z)
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs) Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO) We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z)
CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought [33.32415197728357]
Speech Language Models (SLMs) have demonstrated impressive performance on speech translation tasks. We introduce a three-stage training framework designed to activate the chain-of-thought capabilities of SLMs. We propose CoT-ST, a speech translation model that utilizes multimodal CoT to decompose speech translation into sequential steps of speech recognition and translation.
arXiv Detail & Related papers (2024-09-29T01:48:09Z)
FASST: Fast LLM-based Simultaneous Speech Translation [9.65638081954595]
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. We propose FASST, a fast large language model based method for streaming speech translation. Experiment results show that FASST achieves the best quality-latency trade-off.
arXiv Detail & Related papers (2024-08-18T10:12:39Z)
Rethinking and Improving Multi-task Learning for End-to-end Speech Translation [51.713683037303035]
We investigate the consistency between different tasks, considering different times and modules. We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations. We propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation.
arXiv Detail & Related papers (2023-11-07T08:48:46Z)
WACO: Word-Aligned Contrastive Learning for Speech Translation [11.67083845641806]
Speech Translation (E2E) aims to directly translate source speech into target text. Existing ST methods perform poorly when only extremely small speech-text data are available for training. We propose Word-Aligned COntrastive learning (WACO), a simple and effective method for extremely low-resource speech-to-text translation.
arXiv Detail & Related papers (2022-12-19T10:49:35Z)
BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric [66.73705349465207]
End-to-end speech-to-speech translation (S2ST) is generally evaluated with text-based metrics. We propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.
arXiv Detail & Related papers (2022-12-16T14:00:26Z)
STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation [37.51435498386953]
We propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy.
arXiv Detail & Related papers (2022-03-20T01:49:53Z)
Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech [8.465993273653554]
We investigate the use of a multi-speaker Text-To-Speech system to synthesize speech in support of speaker recognition. We observe on our datasets that TTS synthesized speech improves cross-domain speaker recognition performance. We also explore the effectiveness of different types of text transcripts used for TTS synthesis.
arXiv Detail & Related papers (2020-11-24T00:48:54Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.