Related papers: Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

URL: http://arxiv.org/abs/2512.14681v1
Date: Tue, 16 Dec 2025 18:45:18 GMT
Title: Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
Authors: Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang,
Abstract summary: Jacobi Forcing is a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories.<n>We introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup.
Score: 41.89066334075016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.

Related papers

d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation [31.922313594074925]
Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs.<n>Current methods typically focus on only one-side of the coin, targeting either efficiency or performance.<n>We propose d3LLM (Pseudo-Distilled Diffusion Large Language Model), striking a balance between accuracy and parallelism.
arXiv Detail & Related papers (2026-01-12T14:25:36Z)
DiRL: An Efficient Post-Training Framework for Diffusion Language Models [54.405206032785706]
Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models.<n>Existing methods suffer from computational inefficiency and objective mismatches between training and inference.<n>We introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference.
arXiv Detail & Related papers (2025-12-23T08:33:19Z)
dParallel: Learnable Parallel Decoding for dLLMs [77.24184219948337]
Diffusion large language models (dLLMs) offer parallel token prediction and lower inference latency.<n>Existing open-source models still require nearly token-length decoding steps to ensure performance.<n>We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling.
arXiv Detail & Related papers (2025-09-30T16:32:52Z)
Fast-dLLM v2: Efficient Block-Diffusion LLM [64.38006546510337]
Fast-dLLM v2 is a block diffusion language model that adapts pretrained AR models into dLLMs for parallel text generation.<n>This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens)
arXiv Detail & Related papers (2025-09-30T14:40:18Z)
CoVAE: Consistency Training of Variational Autoencoders [9.358185536754537]
We propose a novel single-stage generative autoencoding framework that adopts techniques from consistency models to train a VAE architecture.<n>We show that CoVAE can generate high-quality samples in one or few steps without the use of a learned prior.<n>Our approach provides a unified framework for autoencoding and diffusion-style generative modeling and provides a viable route for one-step generative high-performance autoencoding.
arXiv Detail & Related papers (2025-07-12T01:32:08Z)
DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation [68.19756761027351]
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models.<n>We investigate their denoising processes and reinforcement learning methods.<n>Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework.
arXiv Detail & Related papers (2025-06-25T17:35:47Z)
Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z)
Training and Inference Efficiency of Encoder-Decoder Speech Models [25.031622057759492]
We focus on the efficiency angle and ask the questions of whether we are training these speech models efficiently.<n>We show that negligence in mini-batch sampling leads to more than 50% being spent on padding.<n>We find that adjusting the model architecture to transfer model parameters from the decoder to the encoder results in a 3x inference speedup.
arXiv Detail & Related papers (2025-03-07T20:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.