Related papers: CTC-based Non-autoregressive Speech Translation

CTC-based Non-autoregressive Speech Translation

URL: http://arxiv.org/abs/2305.17358v1
Date: Sat, 27 May 2023 03:54:09 GMT
Title: CTC-based Non-autoregressive Speech Translation
Authors: Chen Xu, Xiaoqian Liu, Xiaowen Liu, Qingxuan Sun, Yuhao Zhang, Murun Yang, Qianqian Dong, Tom Ko, Mingxuan Wang, Tong Xiao, Anxiang Ma and Jingbo Zhu
Abstract summary: We investigate the potential of connectionist temporal classification for non-autoregressive speech translation. We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts. Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
Score: 51.37920141751813
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Combining end-to-end speech translation (ST) and non-autoregressive (NAR) generation is promising in language and speech processing for their advantages of less error propagation and low latency. In this paper, we investigate the potential of connectionist temporal classification (CTC) for non-autoregressive speech translation (NAST). In particular, we develop a model consisting of two encoders that are guided by CTC to predict the source and target texts, respectively. Introducing CTC into NAST on both language sides has obvious challenges: 1) the conditional independent generation somewhat breaks the interdependency among tokens, and 2) the monotonic alignment assumption in standard CTC does not hold in translation tasks. In response, we develop a prediction-aware encoding approach and a cross-layer attention approach to address these issues. We also use curriculum learning to improve convergence of training. Experiments on the MuST-C ST benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$\times$, which is comparable to the autoregressive counterpart and even outperforms the previous best result of 0.9 BLEU points.

Related papers

CTC-based Non-autoregressive Textless Speech-to-Speech Translation [38.99922762754443]
Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation.
arXiv Detail & Related papers (2024-06-11T15:00:33Z)
Markovian Transformers for Informative Language Modeling [0.9642500063568188]
Chain-of-Thought (CoT) reasoning holds great promise for explaining the outputs of language models. Recent studies have highlighted significant challenges in its practical application for interpretability. We propose a technique to factor next-token prediction through intermediate CoT text, ensuring the CoT is causally load-bearing.
arXiv Detail & Related papers (2024-04-29T17:36:58Z)
Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition [46.41096278421193]
BiL-CTC+ bridges the gap between audio and text as well as between source and target languages. Our method also yields significant improvements in speech recognition performance.
arXiv Detail & Related papers (2023-09-21T16:28:42Z)
Scalable Learning of Latent Language Structure With Logical Offline Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text. As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z)
Pre-training for Speech Translation: CTC Meets Optimal Transport [29.807861658249923]
We show that the connectionist temporal classification (CTC) loss can reduce the modality gap by design. We propose a novel pre-training method combining CTC and optimal transport to further reduce this gap. Our method pre-trains a Siamese-like model composed of two encoders, one for acoustic inputs and the other for textual inputs, such that they produce representations that are close to each other in the Wasserstein space.
arXiv Detail & Related papers (2023-01-27T14:03:09Z)
CTC Alignments Improve Autoregressive Translation [145.90587287444976]
We argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework. Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
arXiv Detail & Related papers (2022-10-11T07:13:50Z)
Non-Autoregressive Neural Machine Translation: A Call for Clarity [3.1447111126465]
We take a step back and revisit several techniques that have been proposed for improving non-autoregressive translation models. We provide novel insights for establishing strong baselines using length prediction or CTC-based architecture variants. We contribute standardized BLEU, chrF++, and TER scores using sacreBLEU on four translation tasks.
arXiv Detail & Related papers (2022-05-21T12:15:22Z)
Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in Non-Autoregressive Translation [98.11249019844281]
Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models. We propose reverse KD to rejuvenate more alignments for low-frequency target words. Results demonstrate that the proposed approach can significantly and universally improve translation quality.
arXiv Detail & Related papers (2021-06-02T02:41:40Z)
Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation [62.943925893616196]
We study the possibilities of building a non-autoregressive speech-to-text translation model using connectionist temporal classification (CTC) CTC's success on translation is counter-intuitive due to its monotonicity assumption, so we analyze its reordering capability. Our analysis shows that transformer encoders have the ability to change the word order.
arXiv Detail & Related papers (2021-05-11T07:48:45Z)
Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder. The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead. Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.