Fugu-MT 論文翻訳(概要): ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

論文の概要: ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

arxiv url: http://arxiv.org/abs/2508.08895v1
Date: Tue, 12 Aug 2025 12:35:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-13 21:07:34.422956
Title: ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs
Title（参考訳）: ASPD: LLMにおける固有並列性探索による適応シリアル-パラレルデコーディングのアンロック
Authors: Keyu Chen, Zhifeng Shen, Daohai Yu, Haoqian Wu, Wei Wen, Jianfeng He, Ruizhi Qiao, Xing Sun,
Abstract要約: 大規模言語モデル(LLM)は、自動回帰デコードパラダイムのため、推論遅延の大きな問題を生じさせる。本稿では、並列化可能なデータの自動構築と効率的な並列化機構の2つの課題に対処する適応シリアル-パラレルデコーディング(ASPD)を提案する。我々のフレームワークは、効率的なLCM並列推論のための基盤となるベンチマークを設定し、AIによるカスタマーサービスボットや回答検索エンジンのようなレイテンシに敏感なアプリケーションへのデプロイの道を開く。
参考スコア（独自算出の注目度）: 34.477777651648914
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By re-examining the outputs of autoregressive models, we observed that some segments exhibit parallelizable structures, which we term intrinsic parallelism. Decoding each parallelizable branch simultaneously (i.e. parallel decoding) can significantly improve the overall inference speed of LLMs. In this paper, we propose an Adaptive Serial-Parallel Decoding (ASPD), which addresses two core challenges: automated construction of parallelizable data and efficient parallel decoding mechanism. More specifically, we introduce a non-invasive pipeline that automatically extracts and validates parallelizable structures from the responses of autoregressive models. To empower efficient adaptive serial-parallel decoding, we implement a Hybrid Decoding Engine which enables seamless transitions between serial and parallel decoding modes while maintaining a reusable KV cache, maximizing computational efficiency. Extensive evaluations across General Tasks, Retrieval-Augmented Generation, Mathematical Reasoning, demonstrate that ASPD achieves unprecedented performance in both effectiveness and efficiency. Notably, on Vicuna Bench, our method achieves up to 3.19x speedup (1.85x on average) while maintaining response quality within 1% difference compared to autoregressive models, realizing significant acceleration without compromising generation quality. Our framework sets a groundbreaking benchmark for efficient LLM parallel inference, paving the way for its deployment in latency-sensitive applications such as AI-powered customer service bots and answer retrieval engines.
Abstract（参考訳）: 大規模言語モデル (LLMs) のスケールと複雑さの増大は、主に次世代の予測のシーケンシャルな性質を特徴とする自己回帰デコードパラダイムによって、推論遅延の重大な問題を引き起こす。自己回帰モデルの出力を再検討することにより、いくつかのセグメントが並列化可能な構造を示し、本質的並列性(intrinsic parallelism)と呼ぶ。各並列化可能な分岐の復号化(すなわち並列復号化)は、LLMの全体的な推論速度を大幅に向上させることができる。本稿では、並列化可能なデータの自動構築と効率的な並列化機構の2つの課題に対処する適応シリアル-パラレルデコーディング(ASPD)を提案する。具体的には、自動回帰モデルの応答から並列化可能な構造を自動的に抽出し、検証する非侵襲パイプラインを導入する。並列デコーディングを効率よく行うために,再利用可能なKVキャッシュを維持しつつ,シリアルモードと並列デコーディングモードのシームレスな遷移を可能にし,計算効率を最大化するハイブリッドデコーディングエンジンを実装した。一般タスク、検索・拡張生成、数学的推論といった幅広い評価は、ASPDが有効性と効率の両方で前例のない性能を達成することを実証している。特に,Vicuna Benchでは,応答品質を自己回帰モデルと1%の差で維持しながら,最大3.19倍の高速化(平均1.85倍)を実現し,生成品質を損なうことなく大幅な加速を実現する。我々のフレームワークは、効率的なLCM並列推論のための基盤となるベンチマークを設定し、AIによるカスタマーサービスボットや回答検索エンジンのようなレイテンシに敏感なアプリケーションへのデプロイの道を開く。

論文の概要: ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

関連論文リスト