Improving End-to-end Speech Translation by Leveraging Auxiliary Speech
and Text Data
- URL: http://arxiv.org/abs/2212.01778v1
- Date: Sun, 4 Dec 2022 09:27:56 GMT
- Title: Improving End-to-end Speech Translation by Leveraging Auxiliary Speech
and Text Data
- Authors: Yuhao Zhang, Chen Xu, Bojie Hu, Chunliang Zhang, Tong Xiao, Jingbo Zhu
- Abstract summary: We present a method for introducing a text encoder into pre-trained end-to-end speech translation systems.
It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text)
- Score: 38.816953592085156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method for introducing a text encoder into pre-trained
end-to-end speech translation systems. It enhances the ability of adapting one
modality (i.e., source-language speech) to another (i.e., source-language
text). Thus, the speech translation model can learn from both unlabeled and
labeled data, especially when the source-language text data is abundant. Beyond
this, we present a denoising method to build a robust text encoder that can
deal with both normal and noisy text data. Our system sets new
state-of-the-arts on the MuST-C En-De, En-Fr, and LibriSpeech En-Fr tasks.
Related papers
- Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [66.92823764664206]
We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text.
While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation.
Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
arXiv Detail & Related papers (2022-07-03T04:26:53Z) - Sample, Translate, Recombine: Leveraging Audio Alignments for Data
Augmentation in End-to-end Speech Translation [14.839931533868176]
We present a novel approach to data augmentation that leverages audio alignments, linguistic properties, and translation.
Our method delivers consistent improvements of up to 0.9 and 1.1 BLEU points on five language pairs on CoVoST 2 and on two language pairs on Europarl-ST, respectively.
arXiv Detail & Related papers (2022-03-16T17:15:46Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance.
Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.