Fugu-MT 論文翻訳(概要): Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models

論文の概要: Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models

arxiv url: http://arxiv.org/abs/2604.14156v1
Date: Sun, 22 Mar 2026 14:27:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.680781
Title: Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models
Title（参考訳）: 圧縮センシング誘導型推論型大規模言語モデルのための構造化
Authors: Andrew Kiruluta,
Abstract要約: 大規模言語モデルは強力な生成性能を提供するが、膨大なパラメータ数、メモリ使用量、復号遅延のコストがかかる。動的LLM実行のための統合圧縮センシング誘導フレームワークを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial compression, while prompt-compression methods reduce latency by removing redundant input tokens. However, these two directions remain largely separate. Most model-compression methods are static and optimized offline, and they do not exploit the fact that different prompts and decoding steps activate different latent computational pathways. Prompt-compression methods reduce sequence length, but they do not adapt the executed model subnetwork. We propose a unified compressed-sensing-guided framework for dynamic LLM execution. Random measurement operators probe latent model usage, sparse recovery estimates task-conditioned and token-adaptive support sets, and the recovered supports are compiled into hardware-efficient sparse execution paths over blocks, attention heads, channels, and feed-forward substructures. The framework introduces five key contributions: task-conditioned measurements, so different prompts induce different sparse supports; token-adaptive recovery, so active substructures are re-estimated during decoding; formal sample-complexity bounds under restricted isometry or mutual incoherence assumptions; compile-to-hardware constraints that restrict recovery to GPU-efficient structures; and a joint objective that unifies prompt compression with model reduction. Together, these components recast LLM inference as a measurement-and-recovery problem with explicit approximation guarantees and deployment-oriented speedup constraints.
Abstract（参考訳）: 大規模言語モデルは強力な生成性能を提供するが、膨大なパラメータ数、メモリ使用量、復号遅延のコストがかかる。以前の研究では、プルーニングと構造化されたスパーシリティは相当な圧縮の下で精度を保ち、プロンプト圧縮法は冗長な入力トークンを除去することで遅延を低減することが示されている。しかし、この2つの方向は依然として大きく分かれている。ほとんどのモデル圧縮法は静的で最適化されたオフラインであり、異なるプロンプトとデコードステップが異なる遅延計算経路を活性化するという事実を生かしていない。プロンプト圧縮法はシーケンス長を減少させるが、実行されたモデルサブネットワークに適応しない。動的LLM実行のための統合圧縮センシング誘導フレームワークを提案する。ランダム測定オペレータは、遅延モデルの使用、タスク条件付きおよびトークン適応サポートセットのスパースリカバリ推定、および、回復されたサポートセットを、ブロック、アテンションヘッド、チャネル、フィードフォワードサブ構造上のハードウェア効率の良いスパース実行パスにコンパイルする。このフレームワークでは、タスク条件付き測定、異なるプロンプトの異なるスパースサポートの誘導、トークン適応型リカバリ、アクティブなサブ構造の再推定、制限されたアイソメトリまたは相互不整合仮定の下での正式なサンプル複雑な境界、GPU効率の高い構造へのリカバリを制限するコンパイルとハードウエアの制約、モデルリダクションによる即時圧縮を統一する共同目的の5つの重要なコントリビューションが紹介されている。これらのコンポーネントは、LLM推論を明示的な近似保証とデプロイメント指向のスピードアップ制約によって測定と回復の問題として再考する。

論文の概要: Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models

関連論文リスト