Related papers: Direct Simultaneous Translation Activation for Large Audio-Language Models

Direct Simultaneous Translation Activation for Large Audio-Language Models

URL: http://arxiv.org/abs/2509.15692v1
Date: Fri, 19 Sep 2025 07:12:18 GMT
Title: Direct Simultaneous Translation Activation for Large Audio-Language Models
Authors: Pei Zhang, Yiming Wang, Jialong Tang, Baosong Yang, Rui Wang, Derek F. Wong, Fei Huang,
Abstract summary: Simultaneous speech-to-text translation (Simul-S2TT) aims to translate speech into target text in real time.<n>We introduce bf Simultaneous bf Self-bf Augmentation (bf SimulSA), a strategy that utilizes LALMs' inherent capabilities to obtain simultaneous data.
Score: 58.03785696031301
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Simultaneous speech-to-text translation (Simul-S2TT) aims to translate speech into target text in real time, outputting translations while receiving source speech input, rather than waiting for the entire utterance to be spoken. Simul-S2TT research often modifies model architectures to implement read-write strategies. However, with the rise of large audio-language models (LALMs), a key challenge is how to directly activate Simul-S2TT capabilities in base models without additional architectural changes. In this paper, we introduce {\bf Simul}taneous {\bf S}elf-{\bf A}ugmentation ({\bf SimulSA}), a strategy that utilizes LALMs' inherent capabilities to obtain simultaneous data by randomly truncating speech and constructing partially aligned translation. By incorporating them into offline SFT data, SimulSA effectively bridges the distribution gap between offline translation during pretraining and simultaneous translation during inference. Experimental results demonstrate that augmenting only about {\bf 1\%} of the simultaneous data, compared to the full offline SFT data, can significantly activate LALMs' Simul-S2TT capabilities without modifications to model architecture or decoding strategy.

Related papers

Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting? [13.202203902821333]
We systematically compare Chain-of-Thought (CoT) and Direct prompting under increasing amounts of Speech-to-Text Translation (S2TT) data.<n>Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.
arXiv Detail & Related papers (2025-10-03T15:23:32Z)
SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation [14.57248739077317]
This paper proposes SimulS2S-LLM, which trains speech LLMs offline and employs a test-time policy to guide simultaneous inference.<n>SimulS2S-LLM achieves simultaneous speech-to-speech translation (Simul-S2ST) by predicting discrete output speech tokens and then synthesising output speech using a pre-trained vocoder.
arXiv Detail & Related papers (2025-04-22T01:05:32Z)
Efficient and Adaptive Simultaneous Speech Translation with Fully Unidirectional Architecture [14.056534007451763]
Simultaneous speech translation (SimulST) produces translations incrementally while processing partial speech input.<n>Existing LLM-based SimulST approaches incur significant computational overhead due to repeated encoding of bidirectional speech encoder.<n>We introduce Efficient and Adaptive Simultaneous Speech Translation (EASiST) with fully unidirectional architecture.
arXiv Detail & Related papers (2025-04-16T06:46:15Z)
Tagged End-to-End Simultaneous Speech Translation Training using Simultaneous Interpretation Data [16.05089716626287]
We propose an effective way to train a SimulST model using mixed data of SI and offline. Experiment results show improvements of BLEURT in different latency ranges.
arXiv Detail & Related papers (2023-06-14T15:42:06Z)
Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z)
Anticipation-free Training for Simultaneous Translation [70.85761141178597]
Simultaneous translation (SimulMT) speeds up the translation process by starting to translate before the source sentence is completely available. Existing methods increase latency or introduce adaptive read-write policies for SimulMT models to handle local reordering and improve translation quality. We propose a new framework that decomposes the translation process into the monotonic translation step and the reordering step.
arXiv Detail & Related papers (2022-01-30T16:29:37Z)
Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language. We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
SimulSLT: End-to-End Simultaneous Sign Language Translation [55.54237194555432]
Existing sign language translation methods need to read all the videos before starting the translation. We propose SimulSLT, the first end-to-end simultaneous sign language translation model. SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model.
arXiv Detail & Related papers (2021-12-08T11:04:52Z)
RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer [33.876412404781846]
RealTranS is an end-to-end model for simultaneous speech translation. It maps speech features into text space with a weighted-shrinking operation and a semantic encoder. Experiments show that RealTranS with the Wait-K-Stride-N strategy outperforms prior end-to-end models.
arXiv Detail & Related papers (2021-06-09T06:35:46Z)
Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation. The key idea is to generate source transcript and target translation text with a single decoder. Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.