Consecutive Decoding for Speech-to-text Translation
- URL: http://arxiv.org/abs/2009.09737v4
- Date: Fri, 15 Apr 2022 03:50:36 GMT
- Title: Consecutive Decoding for Speech-to-text Translation
- Authors: Qianqian Dong, Mingxuan Wang, Hao Zhou, Shuang Xu, Bo Xu, Lei Li
- Abstract summary: COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
- Score: 51.155661276936044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech-to-text translation (ST), which directly translates the source
language speech to the target language text, has attracted intensive attention
recently. However, the combination of speech recognition and machine
translation in a single model poses a heavy burden on the direct cross-modal
cross-lingual mapping. To reduce the learning difficulty, we propose
COnSecutive Transcription and Translation (COSTT), an integral approach for
speech-to-text translation. The key idea is to generate source transcript and
target translation text with a single decoder. It benefits the model training
so that additional large parallel text corpus can be fully exploited to enhance
the speech translation training. Our method is verified on three mainstream
datasets, including Augmented LibriSpeech English-French dataset, IWSLT2018
English-German dataset, and TED English-Chinese dataset. Experiments show that
our proposed COSTT outperforms or on par with the previous state-of-the-art
methods on the three datasets. We have released our code at
\url{https://github.com/dqqcasia/st}.
Related papers
- Translatotron 3: Speech to Speech Translation with Monolingual Data [23.376969078371282]
Translatotron 3 is a novel approach to unsupervised direct speech-to-speech translation from monolingual speech-text datasets.
Results show that Translatotron 3 outperforms a baseline cascade system.
arXiv Detail & Related papers (2023-05-27T18:30:54Z) - Improving End-to-end Speech Translation by Leveraging Auxiliary Speech
and Text Data [38.816953592085156]
We present a method for introducing a text encoder into pre-trained end-to-end speech translation systems.
It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text)
arXiv Detail & Related papers (2022-12-04T09:27:56Z) - Speech-to-Speech Translation For A Real-world Unwritten Language [62.414304258701804]
We study speech-to-speech translation (S2ST) that translates speech from one language into another language.
We present an end-to-end solution from training data collection, modeling choices to benchmark dataset release.
arXiv Detail & Related papers (2022-11-11T20:21:38Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [66.92823764664206]
We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text.
While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation.
Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
arXiv Detail & Related papers (2022-07-03T04:26:53Z) - T-Modules: Translation Modules for Zero-Shot Cross-Modal Machine
Translation [19.332953510406327]
We present a new approach to perform zero-shot cross-modal transfer between speech and text for translation tasks.
Multilingual speech and text are encoded in a joint fixed-size representation space.
We compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities.
arXiv Detail & Related papers (2022-05-24T17:23:35Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.