Fugu-MT 論文翻訳(概要): HAMburger: Accelerating LLM Inference via Token Smashing

論文の概要: HAMburger: Accelerating LLM Inference via Token Smashing

arxiv url: http://arxiv.org/abs/2505.20438v1
Date: Mon, 26 May 2025 18:34:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-28 17:05:58.247367
Title: HAMburger: Accelerating LLM Inference via Token Smashing
Title（参考訳）: HAMburger:Token SmashingによるLCM推論の高速化
Authors: Jingyu Liu, Ce Zhang,
Abstract要約: HAMburgerは階層的な自己回帰モデルであり、大規模言語モデル推論におけるリソース割り当てを再定義する。 HAMburgerはKVキャッシュの計算を最大2$times$に減らし、最大2$times$TPSを実現する。本手法では,ハードウェアに依存しない設計で計算効率とメモリ効率の両方を必要とする,極めて困難な推論方式を探索する。
参考スコア（独自算出の注目度）: 11.730038057167622
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The growing demand for efficient Large Language Model (LLM) inference requires a holistic optimization on algorithms, systems, and hardware. However, very few works have fundamentally changed the generation pattern: each token needs one forward pass and one KV cache. This can be sub-optimal because we found that LLMs are extremely capable of self-identifying the exact dose of information that a single KV cache can store, and many tokens can be generated confidently without global context. Based on this insight, we introduce HAMburger, a Hierarchically Auto-regressive Model that redefines resource allocation in LLMs by moving beyond uniform computation and storage per token during inference. Stacking a compositional embedder and a micro-step decoder in between a base LLM, HAMburger smashes multiple tokens into a single KV and generates several tokens per step. Additionally, HAMburger functions as a speculative decoding framework where it can blindly trust self-drafted tokens. As a result, HAMburger shifts the growth of KV cache and forward FLOPs from linear to sub-linear with respect to output length, and adjusts its inference speed based on query perplexity and output structure. Extensive evaluations show that HAMburger reduces the KV cache computation by up to 2$\times$ and achieves up to 2$\times$ TPS, while maintaining quality in both short- and long-context tasks. Our method explores an extremely challenging inference regime that requires both computation- and memory-efficiency with a hardware-agnostic design.
Abstract（参考訳）: 効率的なLarge Language Model (LLM) 推論への需要が高まっているため、アルゴリズム、システム、ハードウェアの全体的な最適化が必要である。それぞれのトークンには1つのフォワードパスと1つのKVキャッシュが必要です。 LLMは単一のKVキャッシュが格納できる情報の正確な量を自己識別する能力が極めて高いため,グローバルなコンテキストなしに多くのトークンを確実に生成できるため,これは準最適である。この知見に基づいて,LLMにおける資源割り当てを再定義する階層的自己回帰モデルであるHAMburgerを紹介した。合成埋め込み器とマイクロステップデコーダをベースLLMの間に重ねると、HAMburgerは複数のトークンを単一のKVに分割し、ステップ毎に複数のトークンを生成する。さらに、HAMburgerは投機的デコードフレームワークとして機能し、セルフドラフトトークンを盲目的に信頼することができる。その結果、HAMburgerは、KVキャッシュとフォワードFLOPの成長を出力長に対して線形からサブ線形にシフトさせ、クエリの難易度と出力構造に基づいて推論速度を調整する。大規模な評価では、HAMburgerはKVキャッシュの計算を最大2$\times$、最大2$\times$TPSまで削減し、短文タスクと長文タスクの両方の品質を維持している。本手法では,ハードウェアに依存しない設計で計算効率とメモリ効率の両方を必要とする,極めて困難な推論方式を探索する。

論文の概要: HAMburger: Accelerating LLM Inference via Token Smashing

関連論文リスト