Fugu-MT 論文翻訳(概要): K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

論文の概要: K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

arxiv url: http://arxiv.org/abs/2606.10820v2
Date: Wed, 10 Jun 2026 06:43:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-11 14:23:44.399016
Title: K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling
Title（参考訳）: K-Forcing: プッシュフォワード言語モデリングによる次世代K-Tokenデコーディング
Authors: Zhiwei Tang, Yuanyu He, Yizheng Han, Wangbo Zhao, Jiasheng Tang, Fan Wang, Bohan Zhuang,
Abstract要約: K-Forcingは、next-k-tokenデコーディングのためのプッシュフォワード言語モデリングパラダイムである。標準因果変換器のバックボーンを用いて,LM1B と OpenWebText 上で K-Forcing を評価する。
参考スコア（独自算出の注目度）: 37.46942663162738
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and diffusion language models, can yield speedups under certain conditions but do not directly address high-load batch serving--the scenario most critical for industrial-scale deployment. We introduce K-Forcing, a push-forward language modeling paradigm for joint next-k-token decoding. K-Forcing distills an existing AR model into a conditional push-forward mapping--one that transforms independent uniform noise variables into a joint sample of multiple future tokens in a single forward pass. This design preserves fixed-length outputs, reuses the AR teacher backbone, and remains compatible with standard AR serving infrastructure. We train this mapping via progressive self-forcing distillation, which gradually expands the prediction window while enabling the student to closely match the sequence distribution of the AR teacher. We evaluate K-Forcing on LM1B and OpenWebText using a standard causal Transformer backbone. When aggressively configured to generate k = 4 tokens per forward pass, K-Forcing delivers approximately 2.4-3.5x speedup across different batch sizes, while incurring modest quality degradation relative to its AR teacher. As inference increasingly dominates the lifetime compute cost of modern LLMs, K-Forcing offers a promising route toward accelerating AR generation under real-world high-load deployment.
Abstract（参考訳）: 自動回帰(AR)言語モデリングはテキスト生成において支配的なパラダイムであるが、シーケンシャルなトークン・バイ・トーケンデコーディングは推論をメモリバウンドで非効率にする。投機的復号化や拡散言語モデルのような既存の加速手法は、特定の条件下でスピードアップするが、直接的に高負荷のバッチサービスに対処しない。 K-forward言語モデリングのパラダイムであるK-Forcingを導入する。 K強制(K-Forcing)は、既存のARモデルを条件付きプッシュフォワードマッピングに蒸留する。この設計は、固定長の出力を保持し、AR教師のバックボーンを再利用し、標準のARサービスインフラとの互換性を維持している。本研究では, 学生がAR教師のシーケンス分布を密に一致させつつ, 予測窓を徐々に拡張する進行自己強制蒸留を用いて, このマッピングを訓練する。標準因果変換器のバックボーンを用いて,LM1B と OpenWebText 上で K-Forcing を評価する。フォワードパス毎に k = 4 トークンを生成するようにアグレッシブに設定された場合、K-Forcing は約2.4-3.5x のスピードアップをバッチサイズで提供し、同時に、その AR 教師に対して質の低下を引き起こす。推論が現代のLLMの寿命計算コストを支配しているため、K-Forcingは現実世界の高負荷デプロイメント下でのAR生成を加速するための有望なルートを提供する。

論文の概要: K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

関連論文リスト