Related papers: Decision Attentive Regularization to Improve Simultaneous Speech Translation Systems

Decision Attentive Regularization to Improve Simultaneous Speech Translation Systems

URL: http://arxiv.org/abs/2110.15729v1
Date: Wed, 13 Oct 2021 08:33:31 GMT
Title: Decision Attentive Regularization to Improve Simultaneous Speech Translation Systems
Authors: Mohd Abbas Zaidi, Beomseok Lee, Nikhil Kumar Lakumarapu, Sangha Kim, Chanwoo Kim
Abstract summary: Simultaneous Speech-to-text Translation (SimulST) systems translate source speech in tandem with the speaker using partial input. Recent works have tried to leverage the text translation task to improve the performance of Speech Translation (ST) in the offline domain. Motivated by these improvements, we propose to add Decision Attentive Regularization (DAR) to Monotonic Multihead Attention (MMA) based SimulST systems.
Score: 12.152208198444182
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Simultaneous Speech-to-text Translation (SimulST) systems translate source speech in tandem with the speaker using partial input. Recent works have tried to leverage the text translation task to improve the performance of Speech Translation (ST) in the offline domain. Motivated by these improvements, we propose to add Decision Attentive Regularization (DAR) to Monotonic Multihead Attention (MMA) based SimulST systems. DAR improves the read/write decisions for speech using the Simultaneous text Translation (SimulMT) task. We also extend several techniques from the offline domain to the SimulST task. Our proposed system achieves significant performance improvements for the MuST-C English-German (EnDe) SimulST task, where we provide an average BLUE score improvement of around 4.57 points or 34.17% across different latencies. Further, the latency-quality tradeoffs establish that the proposed model achieves better results compared to the baseline.

Related papers

Simultaneous Translation with Offline Speech and LLM Models in CUNI Submission to IWSLT 2025 [0.0]
This paper describes Charles University submission to the Simultaneous Speech Translation Task of the IWSLT 2025.<n>We cover all four language pairs with a direct or cascade approach.<n>The backbone of our systems is the offline Whisper speech model, which we use for both translation and transcription in simultaneous mode with the state-of-the-art simultaneous policy AlignAtt.
arXiv Detail & Related papers (2025-06-20T15:27:44Z)
KIT's Offline Speech Translation and Instruction Following Submission for IWSLT 2025 [56.61209412965054]
We present the Karlsruhe Institute of Technology's submissions for the Offline ST and Instruction Following (IF) tracks.<n>We propose a pipeline that employs multiple automatic speech recognition systems, whose outputs are fused using an LLM with document-level context.<n>For the IF track, we develop an end-to-end model that integrates a speech encoder with an LLM to perform a wide range of instruction-following tasks.
arXiv Detail & Related papers (2025-05-19T12:21:29Z)
Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture [14.056534007451763]
Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input. Existing LLM-based SimulST approaches incur significant computational overhead due to repeated encoding of bidirectional speech encoder. We introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture.
arXiv Detail & Related papers (2025-04-16T06:46:15Z)
Speech Translation Refinement using Large Language Models [8.602429274223693]
This paper investigates how large language models (LLMs) can improve the performance of speech translation by introducing a joint refinement process. Through the joint refinement of speech translation (ST) and automatic speech recognition (ASR) transcription via LLMs, the performance of the ST model is significantly improved. Experimental results on the MuST-C and CoVoST 2 datasets, which include seven translation tasks, demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2025-01-25T05:32:42Z)
FASST: Fast LLM-based Simultaneous Speech Translation [9.65638081954595]
Simultaneous speech translation (SST) takes streaming speech input and generates text translation on the fly. We propose FASST, a fast large language model based method for streaming speech translation. Experiment results show that FASST achieves the best quality-latency trade-off.
arXiv Detail & Related papers (2024-08-18T10:12:39Z)
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z)
Rethinking and Improving Multi-task Learning for End-to-end Speech Translation [51.713683037303035]
We investigate the consistency between different tasks, considering different times and modules. We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations. We propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation.
arXiv Detail & Related papers (2023-11-07T08:48:46Z)
DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z)
SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages. We developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z)
The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task [92.5087402621697]
This paper describes the submission of our end-to-end YiTrans speech translation system for the IWSLT 2022 offline task. The YiTrans system is built on large-scale pre-trained encoder-decoder models. Our final submissions rank first on English-German and English-Chinese end-to-end systems in terms of the automatic evaluation metric.
arXiv Detail & Related papers (2022-06-12T16:13:01Z)
Exploring Continuous Integrate-and-Fire for Adaptive Simultaneous Speech Translation [75.86581380817464]
A SimulST system generally includes two components: the pre-decision that aggregates the speech information and the policy that decides to read or write. This paper proposes to model the adaptive policy by adapting the Continuous Integrate-and-Fire (CIF) Compared with monotonic multihead attention (MMA), our method has the advantage of simpler computation, superior quality at low latency, and better generalization to long utterances.
arXiv Detail & Related papers (2022-03-22T23:33:18Z)
STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation [37.51435498386953]
We propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy.
arXiv Detail & Related papers (2022-03-20T01:49:53Z)
FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task [36.51221186190272]
We describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign. Our system is built by leveraging transfer learning across modalities, tasks and languages.
arXiv Detail & Related papers (2021-07-14T19:43:44Z)
SimulMT to SimulST: Adapting Simultaneous Text Translation to End-to-End Simultaneous Speech Translation [23.685648804345984]
Simultaneous text translation and end-to-end speech translation have recently made great progress but little work has combined these tasks together. We investigate how to adapt simultaneous text translation methods such as wait-k and monotonic multihead attention to end-to-end simultaneous speech translation by introducing a pre-decision module. A detailed analysis is provided on the latency-quality trade-offs of combining fixed and flexible pre-decision with fixed and flexible policies.
arXiv Detail & Related papers (2020-11-03T22:47:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.