Bridging the Gaps of Both Modality and Language: Synchronous Bilingual
CTC for Speech Translation and Speech Recognition
- URL: http://arxiv.org/abs/2309.12234v1
- Date: Thu, 21 Sep 2023 16:28:42 GMT
- Title: Bridging the Gaps of Both Modality and Language: Synchronous Bilingual
CTC for Speech Translation and Speech Recognition
- Authors: Chen Xu, Xiaoqian Liu, Erfeng He, Yuhao Zhang, Qianqian Dong, Tong
Xiao, Jingbo Zhu, Dapeng Man, Wu Yang
- Abstract summary: BiL-CTC+ bridges the gap between audio and text as well as between source and target languages.
Our method also yields significant improvements in speech recognition performance.
- Score: 46.41096278421193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we present synchronous bilingual Connectionist Temporal
Classification (CTC), an innovative framework that leverages dual CTC to bridge
the gaps of both modality and language in the speech translation (ST) task.
Utilizing transcript and translation as concurrent objectives for CTC, our
model bridges the gap between audio and text as well as between source and
target languages. Building upon the recent advances in CTC application, we
develop an enhanced variant, BiL-CTC+, that establishes new state-of-the-art
performances on the MuST-C ST benchmarks under resource-constrained scenarios.
Intriguingly, our method also yields significant improvements in speech
recognition performance, revealing the effect of cross-lingual learning on
transcription and demonstrating its broad applicability. The source code is
available at https://github.com/xuchennlp/S2T.
Related papers
- CoT-ST: Enhancing LLM-based Speech Translation with Multimodal Chain-of-Thought [33.32415197728357]
Speech Language Models (SLMs) have demonstrated impressive performance on speech translation tasks.
We introduce a three-stage training framework designed to activate the chain-of-thought capabilities of SLMs.
We propose CoT-ST, a speech translation model that utilizes multimodal CoT to decompose speech translation into sequential steps of speech recognition and translation.
arXiv Detail & Related papers (2024-09-29T01:48:09Z) - SpeechTaxi: On Multilingual Semantic Speech Classification [0.0]
SpeechTaxi is an 80-hour multilingual dataset for semantic speech classification of Bible verses.
MSEs seem to have poor cross-lingual transfer abilities, with E2E substantially lagging CA both in (1) zero-shot transfer to languages unseen in training and (2) multilingual training, i.e., joint training on multiple languages.
We devise a novel CA approach based on transcription to Romanized text as a language-agnostic intermediate representation and show that it represents a robust solution for languages without native ASR support.
arXiv Detail & Related papers (2024-09-10T09:56:15Z) - Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for
Robust Polyglot Text-To-Speech [6.243356997302935]
We introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model.
In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker.
In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model.
arXiv Detail & Related papers (2023-09-15T09:03:14Z) - CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - CTC Alignments Improve Autoregressive Translation [145.90587287444976]
We argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework.
Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
arXiv Detail & Related papers (2022-10-11T07:13:50Z) - End-to-End Speech Translation for Code Switched Speech [13.97982457879585]
Code switching (CS) refers to the phenomenon of interchangeably using words and phrases from different languages.
We focus on CS in the context of English/Spanish conversations for the task of speech translation (ST), generating and evaluating both transcript and translation.
We show that our ST architectures, and especially our bidirectional end-to-end architecture, perform well on CS speech, even when no CS training data is used.
arXiv Detail & Related papers (2022-04-11T13:25:30Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.