Related papers: ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

URL: http://arxiv.org/abs/2512.13586v1
Date: Mon, 15 Dec 2025 17:41:19 GMT
Title: ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Authors: Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li,
Abstract summary: Autoregressive models (ARMs) are hindered by slow sequential inference.<n>We introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency.<n>ReFusion bridges the performance gap to strong ARMs while maintaining a 2.33$times$ average speedup.
Score: 37.86179431483446
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce ReFusion, a novel masked diffusion model that achieves superior performance and efficiency by elevating parallel decoding from the token level to a higher slot level, where each slot is a fixed-length, contiguous sub-sequence. This is achieved through an iterative ``plan-and-infill'' decoding process: a diffusion-based planning step first identifies a set of weakly dependent slots, and an autoregressive infilling step then decodes these selected slots in parallel. The slot-based design simultaneously unlocks full KV cache reuse with a unified causal framework and reduces the learning complexity from the token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with 34% performance gains and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.

Related papers

Causal Autoregressive Diffusion Language Model [70.7353007255797]
CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass.<n>Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation.
arXiv Detail & Related papers (2026-01-29T17:38:29Z)
WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference [44.87788417755154]
We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention.<n>We show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups.
arXiv Detail & Related papers (2025-12-28T01:25:48Z)
Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing [8.705453442427585]
Large language models (LLMs) have achieved near-human performance across diverse reasoning tasks.<n>Their deployment on resource-constrained Internet-of-Things (IoT) devices remains impractical due to massive parameter footprints and memory-intensive autoregressive decoding.<n>This work introduces the first autoregressive-aware split computing framework designed explicitly for LLM deployment on edge devices.
arXiv Detail & Related papers (2025-11-06T02:55:07Z)
Fast-dLLM v2: Efficient Block-Diffusion LLM [64.38006546510337]
Fast-dLLM v2 is a block diffusion language model that adapts pretrained AR models into dLLMs for parallel text generation.<n>This represents a 500x reduction in training data compared to full-attention diffusion LLMs such as Dream (580B tokens)
arXiv Detail & Related papers (2025-09-30T14:40:18Z)
Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning [23.58934174168992]
Autoregressive (AR) language models generate text one token at a time, which limits their inference speed.<n>We propose Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation.<n>We also introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme that better aligns tokens at positions far from context.
arXiv Detail & Related papers (2025-09-18T17:48:21Z)
ToMA: Token Merge with Attention for Diffusion Models [8.079656935981193]
Diffusion models excel in high-fidelity image generation but face scalability limits due to transformers' quadratic attention complexity.<n>We propose Token Merge with Attention (ToMA), an off-the-shelf method that negates token reduction for GPU-aligned efficiency.<n>ToMA reduces SDXL/Flux generation latency by 24%/23%, respectively (with DINO $Delta 0.07$), outperforming prior methods.
arXiv Detail & Related papers (2025-09-13T17:35:00Z)
Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z)
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.