Fugu-MT 論文翻訳(概要): Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models

論文の概要: Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models

arxiv url: http://arxiv.org/abs/2510.04146v1
Date: Sun, 05 Oct 2025 10:50:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.490919
Title: Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models
Title（参考訳）: 次世代予測を超えて:拡散対自己回帰型言語モデルの性能評価
Authors: Minseo Kim, Coleman Hooper, Aditya Tomar, Chenfeng Xu, Mehrdad Farajtabar, Michael W. Mahoney, Kurt Keutzer, Amir Gholami,
Abstract要約: 大規模言語モデル(LLM)は、幅広い自然言語処理(NLP)タスクにおいて最先端のパフォーマンスを達成した。最近、Diffusion Language Models (DLM) が有望な代替アーキテクチャとして登場した。
参考スコア（独自算出の注目度）: 82.87985794856803
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have achieved state-of-the-art performance on a broad range of Natural Language Processing (NLP) tasks, including document processing and coding. Autoregressive Language Models (ARMs), which generate tokens sequentially conditioned on all previous tokens, have been the predominant paradigm for LLMs. However, while these networks have achieved high accuracy across a range of downstream tasks, they exhibit low arithmetic intensity due to the inherent sequential dependency with next-token prediction. Recently, Diffusion Language Models (DLMs) have emerged as a promising alternative architecture. DLMs generate output text in parallel, breaking the limitations of sequential dependency. However, the performance implications of DLMs relative to commonly deployed ARMs are not fully understood. In this work, we present a comprehensive performance study analyzing the performance characteristics of ARMs and DLMs, using both theoretical analysis and profiling data to characterize the trade-offs between these approaches. We illustrate that although DLMs exhibit higher arithmetic intensity compared to ARMs because of their capability to utilize parallelism across sequence lengths, they fail to scale effectively to longer contexts. We then explore DLMs with block-wise decoding, outlining how this approach allows for increased arithmetic intensity, while still scaling well to long contexts (similar to ARMs). We also show interesting trade-offs for batched inference, where we find that ARMs exhibit superior throughput, as they benefit more from parallelism across sequences in the batch. Finally, we highlight opportunities for accelerating DLM inference, and, in particular, highlight the importance of reducing the number of sampling steps for allowing open-source DLMs to provide improved latency relative to ARMs.
Abstract（参考訳）: 大規模言語モデル(LLM)は、文書処理やコーディングを含む幅広い自然言語処理(NLP)タスクにおいて最先端のパフォーマンスを達成した。自己回帰言語モデル(ARM)は、従来のトークンに連続的に条件付けされたトークンを生成するもので、LLMの主要なパラダイムである。しかし、これらのネットワークは下流タスクの範囲で高い精度を達成しているが、次の予測に固有のシーケンシャル依存性のため、演算強度は低い。最近、Diffusion Language Models (DLM) が有望な代替アーキテクチャとして登場した。 DLMは出力テキストを並列に生成し、逐次依存の限界を破る。しかし、一般的にデプロイされるARMに対するDLMのパフォーマンスへの影響は、完全には理解されていない。本研究では,ARMとDLMの性能特性を総合的に解析し,理論的解析とプロファイリングデータを用いて,これらの手法間のトレードオフを特徴づける。 DLMは、列長の並列性を利用する能力があるため、ARMと比較して高い演算強度を示すが、より長いコンテキストに効果的にスケールできない。次にブロック単位のデコーディングでDLMを探索し、このアプローチによって算術強度が向上する一方で、長いコンテキスト(ARMと同じような)に拡張可能であることを概説する。また、バッチ推論の興味深いトレードオフも示しています。そこでは、バッチ内のシーケンス間の並列性により恩恵を受けるため、ARMが優れたスループットを示すことが分かりました。最後に、DLM推論を高速化する機会を強調し、特に、オープンソースDLMがARMに対する遅延を改善するためにサンプリングステップの数を減らすことの重要性を強調した。

論文の概要: Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models

関連論文リスト