Fugu-MT 論文翻訳(概要): From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs

論文の概要: From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs

arxiv url: http://arxiv.org/abs/2512.06776v1
Date: Sun, 07 Dec 2025 10:28:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-09 22:03:54.520058
Title: From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs
Title（参考訳）: 次点から次点へ:拡散LDMの原理的適応経路
Authors: Yuchuan Tian, Yuchen Liang, Jiacheng Sun, Shuo Zhang, Guangwen Yang, Yingte Shu, Sibo Fang, Tianyu Guo, Kai Han, Chao Xu, Hanting Chen, Xinghao Chen, Yunhe Wang,
Abstract要約: 原理的AR-to-block-diffusion適応は,DLMをスクラッチからトレーニングする上で,有効かつ効率的な代替手段であることを示す。 NBDiff-7B(BaseとInstruct)は、長文のモデリングと推論機能を継承し、最先端のパフォーマンスを実現する。
参考スコア（独自算出の注目度）: 58.640039233470766
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) excel at generation but dominant autoregressive (AR) decoding is inherently sequential, creating a throughput bottleneck. Diffusion Language Models (DLMs)--especially block-wise variants--enable parallel generation and intra-block bidirectional reasoning, yet training large DLMs from scratch is costly and wastes the knowledge in mature AR checkpoints. Prior "adaptation" attempts either modify logits or randomly grow attention masks to full-sequence diffusion, or simply transplant AR weights into a block-diffusion recipe, leaving a fundamental mismatch between AR causality and block-wise bidirectionality unaddressed. We reframe adaptation as a intra-paradigm path from AR to Block-Diffusion by viewing AR as Block-Diffusion with blocksize=1. Concretely, we design the pathway of adaptation as follows: we use a context-causal attention mask (causal in context, bidirectional only within the active block), an efficient parallel adaptation procedure, an auxiliary AR loss to maximize data utilization and retain pretrained knowledge, and gradual increment of the generation block size. The recipe integrates cleanly with masked block-diffusion and maintains train-inference consistency. Built on these components, NBDiff-7B (Base and Instruct) could inherit the long-context modeling and reasoning capabilities, and achieve state-of-the-art performance among the 7B-class DLMs, delivering strong gains on general-knowledge, math, and code benchmarks over strong baselines. These results demonstrate that principled AR-to-block-diffusion adaptation is an effective and compute-efficient alternative to training DLMs from scratch. Codes: https://github.com/YuchuanTian/NBDiff.
Abstract（参考訳）: 大規模言語モデル(LLM)は世代ごとに優れているが、支配的な自己回帰(AR)デコーディングは本質的にシーケンシャルであり、スループットのボトルネックを生み出している。 Diffusion Language Models (DLMs) - 特にブロックワイズ変種 - 並列生成とブロック内双方向推論が可能だが、大きなDLMをスクラッチからトレーニングするのはコストがかかり、熟成したARチェックポイントでの知識を無駄にする。以前の"適応"の試みは、ロジットを変更するか、フルシーケンス拡散にランダムに注意マスクを成長させるか、あるいはAR重み付けをブロック拡散レシピに移植するだけで、AR因果性とブロックワイドの双方向性の間に基本的なミスマッチを残している。ブロック化=1でARをブロック拡散と見なして,ARからブロック拡散へのパラダイム内経路としての再編成を行った。具体的には、コンテキスト・因果的注意マスク(文脈上、アクティブブロック内のみ双方向)、効率的な並列適応プロシージャ、データ利用の最大化と事前訓練された知識の維持のための補助AR損失、生成ブロックサイズの漸進的な増加といった適応の経路を設計する。このレシピはマスク付きブロック拡散ときれいに統合され、列車の干渉一貫性を維持する。これらのコンポーネント上に構築されたNBDiff-7B(BaseとInstruct)は、長いコンテキストのモデリングと推論機能を継承し、7BクラスのDLMの中で最先端のパフォーマンスを実現し、一般的な知識、数学、コードベンチマークを強力なベースライン上で実現した。これらの結果から,AR-to-block-diffusion適応はDLMをスクラッチからトレーニングする上で,有効かつ効率的な代替手段であることが示された。コード:https://github.com/YuchuanTian/NBDiff.com

論文の概要: From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs

関連論文リスト