WACO: Word-Aligned Contrastive Learning for Speech Translation
- URL: http://arxiv.org/abs/2212.09359v3
- Date: Fri, 7 Jul 2023 04:56:14 GMT
- Title: WACO: Word-Aligned Contrastive Learning for Speech Translation
- Authors: Siqi Ouyang, Rong Ye, Lei Li
- Abstract summary: Speech Translation (E2E) aims to directly translate source speech into target text.
Existing ST methods perform poorly when only extremely small speech-text data are available for training.
We propose Word-Aligned COntrastive learning (WACO), a simple and effective method for extremely low-resource speech-to-text translation.
- Score: 11.67083845641806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end Speech Translation (E2E ST) aims to directly translate source
speech into target text. Existing ST methods perform poorly when only extremely
small speech-text data are available for training. We observe that an ST
model's performance closely correlates with its embedding similarity between
speech and source transcript. In this paper, we propose Word-Aligned
COntrastive learning (WACO), a simple and effective method for extremely
low-resource speech-to-text translation. Our key idea is bridging word-level
representations for both speech and text modalities via contrastive learning.
We evaluate WACO and other methods on the MuST-C dataset, a widely used ST
benchmark, and on a low-resource direction Maltese-English from IWSLT 2023. Our
experiments demonstrate that WACO outperforms the best baseline by 9+ BLEU
points with only 1-hour parallel ST data. Code is available at
https://github.com/owaski/WACO.
Related papers
- CMU's IWSLT 2024 Simultaneous Speech Translation System [80.15755988907506]
This paper describes CMU's submission to the IWSLT 2024 Simultaneous Speech Translation (SST) task for translating English speech to German text in a streaming manner.
Our end-to-end speech-to-text (ST) system integrates the WavLM speech encoder, a modality adapter, and the Llama2-7B-Base model as the decoder.
arXiv Detail & Related papers (2024-08-14T10:44:51Z) - DUB: Discrete Unit Back-translation for Speech Translation [32.74997208667928]
We propose Discrete Unit Back-translation (DUB) to answer two questions: Is it better to represent speech with discrete units than with continuous features in direct ST?
With DUB, the back-translation technique can successfully be applied on direct ST and obtains an average boost of 5.5 BLEU on MuST-C En-De/Fr/Es.
In the low-resource language scenario, our method achieves comparable performance to existing methods that rely on large-scale external data.
arXiv Detail & Related papers (2023-05-19T03:48:16Z) - BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric [66.73705349465207]
End-to-end speech-to-speech translation (S2ST) is generally evaluated with text-based metrics.
We propose a text-free evaluation metric for end-to-end S2ST, named BLASER, to avoid the dependency on ASR systems.
arXiv Detail & Related papers (2022-12-16T14:00:26Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - Cross-modal Contrastive Learning for Speech Translation [36.63604508886932]
ConST is a cross-modal contrastive learning method for end-to-end speech-to-text translation.
Experiments show that the proposed ConST consistently outperforms the previous methods on.
Its learned representation improves the accuracy of cross-modal speech-text retrieval from 4% to 88%.
arXiv Detail & Related papers (2022-05-05T05:14:01Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.