Fugu-MT 論文翻訳(概要): Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

論文の概要: Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

arxiv url: http://arxiv.org/abs/2512.14681v1
Date: Tue, 16 Dec 2025 18:45:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-17 16:49:26.838424
Title: Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
Title（参考訳）: Jacobi フォーシングを用いた高速かつ高精度な因果並列復号法
Authors: Lanxiang Hu, Siqi Kou, Yichao Fu, Samyam Rajbhandari, Tajana Rosing, Yuxiong He, Zhijie Deng, Hao Zhang,
Abstract要約: Jacobi Forcingはプログレッシブ蒸留パラダイムであり、モデルが独自の並列復号軌道で訓練される。我々は,複数ブロックの復号化とリジェクション・リサイクリングを導入し,最大4.5倍高いトークン受入数と4.0倍のウォールクロック・スピードアップを実現した。
参考スコア（独自算出の注目度）: 41.89066334075016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-token generation has emerged as a promising paradigm for accelerating transformer-based large model inference. Recent efforts primarily explore diffusion Large Language Models (dLLMs) for parallel decoding to reduce inference latency. To achieve AR-level generation quality, many techniques adapt AR models into dLLMs to enable parallel decoding. However, they suffer from limited speedup compared to AR models due to a pretrain-to-posttrain mismatch. Specifically, the masked data distribution in post-training deviates significantly from the real-world data distribution seen during pretraining, and dLLMs rely on bidirectional attention, which conflicts with the causal prior learned during pretraining and hinders the integration of exact KV cache reuse. To address this, we introduce Jacobi Forcing, a progressive distillation paradigm where models are trained on their own generated parallel decoding trajectories, smoothly shifting AR models into efficient parallel decoders while preserving their pretrained causal inference property. The models trained under this paradigm, Jacobi Forcing Model, achieves 3.8x wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on Jacobi Forcing Models' trajectory characteristics, we introduce multi-block decoding with rejection recycling, which enables up to 4.5x higher token acceptance count per iteration and nearly 4.0x wall-clock speedup, effectively trading additional compute for lower inference latency. Our code is available at https://github.com/hao-ai-lab/JacobiForcing.
Abstract（参考訳）: マルチトークン生成は、トランスフォーマーベースの大規模モデル推論を加速するための有望なパラダイムとして登場した。近年,並列デコーディングのための拡散型大言語モデル (dLLM) について検討している。 ARレベルの生成品質を達成するため、多くのテクニックがARモデルをdLLMに適応させ、並列デコードを可能にする。しかし、前列から後列までのミスマッチのため、ARモデルと比較して制限的なスピードアップに悩まされている。特に,ポストトレーニングにおけるマスク付きデータ分布は,事前トレーニング中に見られる実世界のデータ分布とは大きく異なっており,dLLMは,事前トレーニング中に学んだ因果関係と矛盾し,正確なKVキャッシュの再利用を妨げている。これを解決するために, 進行蒸留パラダイムであるJacovi Forcingを導入する。このパラダイムでは, モデル生成した並列デコード軌道上で, 事前学習した因果推論特性を保ちながら, ARモデルをスムーズに並列デコーダに変換する。このパラダイムの下で訓練されたモデルであるJacobi Forcing Modelは、パフォーマンスの損失を最小限に抑えたコーディングと数学ベンチマークで3.8倍のウォールクロック高速化を実現している。 Jacobi Forcing Modelsのトラジェクトリ特性に基づき、リジェクションリサイクルによるマルチブロックデコーディングを導入し、イテレーション毎に最大4.5倍高いトークン受け入れカウントと4.0倍のウォールクロック高速化を実現し、推論遅延の低減を効果的に計算する。私たちのコードはhttps://github.com/hao-ai-lab/JacobiForcing.comで公開されています。

論文の概要: Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

関連論文リスト