Related papers: MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction

MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction

URL: http://arxiv.org/abs/2510.10003v1
Date: Sat, 11 Oct 2025 04:06:20 GMT
Title: MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction
Authors: Jianjin Wang, Runsong Zhao, Xiaoqian Liu, Yuan Ge, Ziqiang Xu, Tong Xiao, Shengxiang Gao, Zhengtao Yu, Jingbo Zhu,
Abstract summary: We introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models.<n>We show that all MTP loss variants consistently improve the quality of S2UT translation.
Score: 49.92201266421949
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current direct speech-to-speech translation methods predominantly employ speech tokens as intermediate representations. However, a single speech token is not dense in semantics, so we generally need multiple tokens to express a complete semantic unit. To address this limitation, we introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models, enabling models to predict multiple subsequent tokens at each position, thereby capturing more complete semantics and enhancing information density per position. Initial MTP implementations apply the loss at the final layer, which improves output representation but initiates information enrichment too late. We hypothesize that advancing the information enrichment process to intermediate layers can achieve earlier and more effective enhancement of hidden representation. Consequently, we propose MTP-S2UT loss, applying MTP loss to hidden representation where CTC loss is computed. Experiments demonstrate that all MTP loss variants consistently improve the quality of S2UT translation, with MTP-S2UT achieving the best performance.

Related papers

FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction [11.691960175716163]
This paper introduces FastMTP, a method that improves multi-step draft quality by aligning MTP training with its inference pattern.<n>Our approach fine-tunes a single MTP head with position-shared weights on self-distilled data, enabling it to capture dependencies among consecutive future tokens.<n> Experimental results across seven diverse benchmarks demonstrate that FastMTP achieves an average of 2.03x speedup compared to standard next token prediction.
arXiv Detail & Related papers (2025-09-16T07:36:26Z)
Entropy-based Coarse and Compressed Semantic Speech Representation Learning [72.18542411704347]
We propose an entropy-based dynamic aggregation framework for learning compressed semantic speech representations.<n> Experiments on ASR, speech-to-text translation, and voice conversion tasks demonstrate that the compressed representations perform on par with or better than dense token sequences.
arXiv Detail & Related papers (2025-08-30T13:50:58Z)
What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study [58.55905182336196]
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation.<n>We investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling.<n>We introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens.
arXiv Detail & Related papers (2025-06-14T15:26:31Z)
L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models [95.53699156138435]
We propose leap multi-token prediction(L-MTP), an innovative token prediction method.<n>Unlike conventional MTP, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass.<n>We theoretically demonstrate the benefit of L-MTP in improving inference efficiency.
arXiv Detail & Related papers (2025-05-23T05:59:46Z)
GOAT-TTS: Expressive and Realistic Speech Generation via A Dual-Branch LLM [42.93855899824886]
We propose a text-to-speech generation approach optimized via a novel dual-branch ArchiTecture (GOAT-TTS)<n>GOAT-TTS combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency.<n> Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models.
arXiv Detail & Related papers (2025-04-15T01:44:56Z)
Pushing the Limits of Zero-shot End-to-End Speech Translation [15.725310520335785]
Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems. We introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data. Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method's superiority.
arXiv Detail & Related papers (2024-02-16T03:06:37Z)
Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing [72.56219471145232]
We propose a ST/MT multi-tasking framework with hard parameter sharing. Our method reduces the speech-text modality gap via a pre-processing stage. We show that our framework improves attentional encoder-decoder, Connectionist Temporal Classification (CTC), transducer, and joint CTC/attention models by an average of +0.5 BLEU.
arXiv Detail & Related papers (2023-09-27T17:48:14Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation [37.51435498386953]
We propose the Speech-TExt Manifold Mixup (STEMM) method to calibrate such discrepancy. Experiments on MuST-C speech translation benchmark and further analysis show that our method effectively alleviates the cross-modal representation discrepancy.
arXiv Detail & Related papers (2022-03-20T01:49:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.