Fugu-MT 論文翻訳(概要): OverFill: Two-Stage Models for Efficient Language Model Decoding

論文の概要: OverFill: Two-Stage Models for Efficient Language Model Decoding

arxiv url: http://arxiv.org/abs/2508.08446v1
Date: Mon, 11 Aug 2025 20:07:34 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-13 21:07:34.223361
Title: OverFill: Two-Stage Models for Efficient Language Model Decoding
Title（参考訳）: OverFill: 効率的な言語モデルデコーディングのための2段階モデル
Authors: Woojeong Kim, Junxiong Wang, Jing Nathan Yan, Mohamed Abdelfattah, Alexander M. Rush,
Abstract要約: 大規模言語モデル(LLM)は多様なタスクにまたがって優れていますが、高い推論コストのため、デプロイメント上の大きな課題に直面しています。プリフィルとデコードステージを分離し,精度と効率のトレードオフを最適化するOverFillを提案する。我々の3B-to-1B OverFill構成は1Bプルーニングモデルを83.2%上回り、8B-to-3B構成は3Bプルーニングモデルを79.2%上回った。
参考スコア（独自算出の注目度）: 68.68408155020568
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) excel across diverse tasks but face significant deployment challenges due to high inference costs. LLM inference comprises prefill (compute-bound) and decode (memory-bound) stages, with decode dominating latency particularly for long sequences. Current decoder-only models handle both stages uniformly, despite their distinct computational profiles. We propose OverFill, which decouples these stages to optimize accuracy-efficiency tradeoffs. OverFill begins with a full model for prefill, processing system and user inputs in parallel. It then switches to a dense pruned model, while generating tokens sequentially. Leveraging more compute during prefill, OverFill improves generation quality with minimal latency overhead. Our 3B-to-1B OverFill configuration outperforms 1B pruned models by 83.2%, while the 8B-to-3B configuration improves over 3B pruned models by 79.2% on average across standard benchmarks. OverFill matches the performance of same-sized models trained from scratch, while using significantly less training data. Our code is available at https://github.com/friendshipkim/overfill.
Abstract（参考訳）: 大規模言語モデル(LLM)は多様なタスクにまたがって優れていますが、高い推論コストのため、デプロイメント上の大きな課題に直面しています。 LLM推論はプリフィル(計算バウンド)とデコード(メモリバウンド)のステージで構成され、特に長いシーケンスにおいてデコードが支配的なレイテンシを持つ。現在のデコーダのみのモデルは、異なる計算プロファイルにもかかわらず、両方のステージを均一に扱う。精度と効率のトレードオフを最適化するために,これらのステージを分離するOverFillを提案する。 OverFillは、プリフィル、処理システム、およびユーザの入力を並列に行うための完全なモデルから始まる。その後、密閉されたモデルに切り替え、トークンを逐次生成する。プリフィル中により多くの計算を活用することで、OverFillは、最小のレイテンシオーバーヘッドで生成品質を改善する。当社の3B-to-1B OverFill構成は1Bプルーニングモデルを83.2%上回り、8B-to-3B構成は標準ベンチマークの平均79.2%で3Bプルーニングモデルを改善している。 OverFillは、スクラッチからトレーニングされた同一サイズのモデルのパフォーマンスと、大幅に少ないトレーニングデータを使用する。私たちのコードはhttps://github.com/friendshipkim/overfill.comで利用可能です。

論文の概要: OverFill: Two-Stage Models for Efficient Language Model Decoding

関連論文リスト