CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation
- URL: http://arxiv.org/abs/2410.05146v1
- Date: Mon, 7 Oct 2024 15:58:03 GMT
- Title: CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation
- Authors: Rui Zhao, Jinyu Li, Ruchao Fan, Matt Post,
- Abstract summary: We introduce a methodology named Connectionist Temporal Classification guided modality matching ( CTC-GMM)
This technique employs CTC to compress the speech sequence into a compact embedding sequence that matches the corresponding text sequence.
Our evaluations with FLEURS and CoVoST2 show that the CTC-GMM approach can increase translation accuracy relatively by 13.9% and 6.4% respectively.
- Score: 36.417792361080615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Models for streaming speech translation (ST) can achieve high accuracy and low latency if they're developed with vast amounts of paired audio in the source language and written text in the target language. Yet, these text labels for the target language are often pseudo labels due to the prohibitive cost of manual ST data labeling. In this paper, we introduce a methodology named Connectionist Temporal Classification guided modality matching (CTC-GMM) that enhances the streaming ST model by leveraging extensive machine translation (MT) text data. This technique employs CTC to compress the speech sequence into a compact embedding sequence that matches the corresponding text sequence, allowing us to utilize matched {source-target} language text pairs from the MT corpora to refine the streaming ST model further. Our evaluations with FLEURS and CoVoST2 show that the CTC-GMM approach can increase translation accuracy relatively by 13.9% and 6.4% respectively, while also boosting decoding speed by 59.7% on GPU.
Related papers
- CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Efficient CTC Regularization via Coarse Labels for End-to-End Speech
Translation [48.203394370942505]
We re-examine the need for genuine vocabulary labels for Connectionist Temporal Classification (CTC) for regularization.
We propose coarse labeling for CTC, which merges vocabulary labels via simple rules, such as using truncation, division or modulo (MOD) operations.
We show that CoLaCTC successfully generalizes to CTC regularization regardless of using transcript or translation for labeling.
arXiv Detail & Related papers (2023-02-21T18:38:41Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - CTC Alignments Improve Autoregressive Translation [145.90587287444976]
We argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework.
Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
arXiv Detail & Related papers (2022-10-11T07:13:50Z) - AlloST: Low-resource Speech Translation without Source Transcription [17.53382405899421]
We propose a learning framework that utilizes a language-independent universal phone recognizer.
The framework is based on an attention-based sequence-to-sequence model.
Experiments conducted on the Fisher Spanish-English and Taigi-Mandarin drama corpora show that our method outperforms the conformer-based baseline.
arXiv Detail & Related papers (2021-05-01T05:30:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.