Self-Speculative Biased Decoding for Faster Live Translation
- URL: http://arxiv.org/abs/2509.21740v1
- Date: Fri, 26 Sep 2025 01:13:37 GMT
- Title: Self-Speculative Biased Decoding for Faster Live Translation
- Authors: Linxiao Zeng, Haoyun Deng, Kangyuan Shu, Shizhen Wang,
- Abstract summary: Self-Speculative Biased Decoding is a novel inference paradigm designed to avoid repeatedly generating output from scratch for a consistently growing input stream.<n>We show that our approach achieves up to 1.7x speedup compared to conventional auto-regressive re-translation without compromising quality.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have recently demonstrated impressive capabilities in various text generation tasks. However, it remains challenging to use them off-the-shelf in streaming applications (such as live translation), where the output must continually update as the input context expands, while still maintaining a reasonable computational cost to meet the latency requirement. In this work, we reexamine the re-translation approach to simultaneous translation and propose Self-Speculative Biased Decoding, a novel inference paradigm designed to avoid repeatedly generating output from scratch for a consistently growing input stream. We propose using the most recent output as a draft for the current growing input context. During the verification stage, the output will be biased towards the draft token for a higher draft acceptance rate. This strategy not only minimizes flickering that might distract users but also leads to higher speedups. Conventional decoding may take charge from the point of divergence after draft verification and continue until the end condition is met. Unlike existing speculative decoding strategies, our approach eliminates the need for draft computations, making it a model-agnostic and plug-and-play solution for accelerating latency-sensitive streaming applications. Experimental results on simultaneous text-to-text re-translation demonstrate that our approach achieves up to 1.7x speedup compared to conventional auto-regressive re-translation without compromising quality. Additionally, it significantly reduces flickering by 80% by incorporating the display-only mask-k technique.
Related papers
- Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z) - Context-Aware Initialization for Reducing Generative Path Length in Diffusion Language Models [0.0]
Large Language Models (DLLMs) enable fully parallel token decoding but often remain impractical at inference time.<n>Most existing acceleration methods focus on traversing this generative trajectory more efficiently via improved solvers or sampling strategies.<n>We propose a training-free interface that injects prompt-conditioned priors from a lightweight auxiliary model into the diffusion initialization.<n>Because injected priors can be imperfect and unmask-only decoding can over-commit early, we also introduce a simple confidence-based remasking mechanism as a form of prior skepticism.
arXiv Detail & Related papers (2025-12-22T03:45:04Z) - Steering Pretrained Drafters during Speculative Decoding [32.75269650141292]
Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification.<n>Its main limitation is drafter-verifier misalignment, which limits token acceptance and reduces overall effectiveness.<n>We introduce a lightweight dynamic alignment mechanism: a steering vector computed from the verifier's hidden states and injected into the pretrained drafter.<n>Our approach boosts the number of accepted tokens by up to 35% under standard sampling and 22% under greedy sampling, all while incurring negligible computational overhead.
arXiv Detail & Related papers (2025-11-13T00:58:32Z) - ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models [67.75439511654078]
Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses.<n>They face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications.<n>We propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment.
arXiv Detail & Related papers (2025-07-01T16:01:08Z) - Overcoming Non-monotonicity in Transducer-based Streaming Generation [26.24357071901915]
This research integrates Transducer's decoding with the history of input stream via a learnable monotonic attention.<n>Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps.<n>Experiments show that our MonoAttn-Transducer effectively handles non-monotonic alignments in streaming scenarios.
arXiv Detail & Related papers (2024-11-26T07:19:26Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [55.0194604505437]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.<n>This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Incremental Blockwise Beam Search for Simultaneous Speech Translation
with Controllable Quality-Latency Tradeoff [49.75167556773752]
Blockwise self-attentional encoder models have emerged as one promising end-to-end approach to simultaneous speech translation.
We propose a modified incremental blockwise beam search incorporating local agreement or hold-$n$ policies for quality-latency control.
arXiv Detail & Related papers (2023-09-20T14:59:06Z) - Look-back Decoding for Open-Ended Text Generation [62.53302138266465]
We propose Look-back, an improved decoding algorithm that tracks the distribution distance between current and historical decoding steps.
Look-back can automatically predict potential repetitive phrase and topic drift, and remove tokens that may cause the failure modes.
We perform decoding experiments on document continuation and story generation, and demonstrate that Look-back is able to generate more fluent and coherent text.
arXiv Detail & Related papers (2023-05-22T20:42:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.