Bridging the Modality Gap for Speech-to-Text Translation
- URL: http://arxiv.org/abs/2010.14920v1
- Date: Wed, 28 Oct 2020 12:33:04 GMT
- Title: Bridging the Modality Gap for Speech-to-Text Translation
- Authors: Yuchen Liu, Junnan Zhu, Jiajun Zhang, and Chengqing Zong
- Abstract summary: End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
- Score: 57.47099674461832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end speech translation aims to translate speech in one language into
text in another language via an end-to-end way. Most existing methods employ an
encoder-decoder structure with a single encoder to learn acoustic
representation and semantic information simultaneously, which ignores the
speech-and-text modality differences and makes the encoder overloaded, leading
to great difficulty in learning such a model. To address these issues, we
propose a Speech-to-Text Adaptation for Speech Translation (STAST) model which
aims to improve the end-to-end model performance by bridging the modality gap
between speech and text. Specifically, we decouple the speech translation
encoder into three parts and introduce a shrink mechanism to match the length
of speech representation with that of the corresponding text transcription. To
obtain better semantic representation, we completely integrate a text-based
translation model into the STAST so that two tasks can be trained in the same
latent space. Furthermore, we introduce a cross-modal adaptation method to
close the distance between speech and text representation. Experimental results
on English-French and English-German speech translation corpora have shown that
our model significantly outperforms strong baselines, and achieves the new
state-of-the-art performance.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Understanding Shared Speech-Text Representations [34.45772613231558]
Mae-stro has developed approaches to train speech models by incorpo-rating text into end-to-end models.
We find that a corpus-specific duration modelfor speech-text alignment is the most important component for learn-ing a shared speech-text representation.
We find that theshared encoder learns a more compact and overlapping speech-textrepresentation than the uni-modal encoders.
arXiv Detail & Related papers (2023-04-27T20:05:36Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [66.92823764664206]
We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text.
While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation.
Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
arXiv Detail & Related papers (2022-07-03T04:26:53Z) - T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine
Translation [19.332953510406327]
We present a new approach to perform zero-shot cross-modal transfer between speech and text for translation tasks.
Multilingual speech and text are encoded in a joint fixed-size representation space.
We compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities.
arXiv Detail & Related papers (2022-05-24T17:23:35Z) - Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining
and Speech Translation [21.622039537743607]
We propose a Fused Acoustic and Text Masked Language Model (FAT-MLM) which jointly learns a unified representation for both acoustic and text in-put.
Experiments on three translation directions show that our proposed speech translation models fine-tuned from FAT-MLM substantially improve translation quality.
arXiv Detail & Related papers (2021-02-10T22:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.