Fugu-MT 論文翻訳(概要): StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving

論文の概要: StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving

arxiv url: http://arxiv.org/abs/2603.28795v1
Date: Tue, 24 Mar 2026 17:19:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-01 15:25:02.476377
Title: StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving
Title（参考訳）: StepCache: LLM実行のための軽量検証と選択パッチ付きステップレベル再利用
Authors: Azam Nouri,
Abstract要約: StepCacheはバックエンドに依存しないステップレベルの再利用レイヤで、出力を順序付けられたステップに分割します。 StepCacheは、選択的パッチによって失敗したリージョンのみを再生する。平均レイテンシは2.13秒から0.67秒、中央レイテンシは2.42秒から0.01秒、p95レイテンシは3.38秒から3.30秒に減少する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We address LLM serving workloads where repeated requests share a common solution structure but differ in localized constraints, such as output schema, variable names, or numeric constants. Prior caching approaches typically reuse either full responses (semantic caching) or model-internal KV/prefix states, which are respectively brittle under partial changes or tightly coupled to specific backends. We present StepCache, a backend-agnostic step-level reuse layer that segments outputs into ordered steps, retrieves the best-matching cached request, verifies steps using lightweight task-aware checks, and regenerates only failing regions via selective patching. StepCache additionally supports strict structured-output enforcement for JSON, including single-step extraction, required-key constraints, and one-shot repair, as well as conservative skip-reuse fallbacks for semantic changes. For linear equations, StepCache promotes verification into correction via a bounded repair loop with a deterministic fallback that guarantees correctness when the backend model fails. In a CPU-only perturbation-heavy micro-benchmark on math and JSON variants, averaged over three seeds, StepCache reduces mean latency from 2.13 s to 0.67 s, median latency from 2.42 s to 0.01 s, and p95 latency from 3.38 s to 3.30 s. It also reduces total token usage from 36.1k to 27.3k and improves end-to-end correctness from 72.5% to 100% under task-specific checks and a stitched-output integrity check. Across requests, 79.7% take the reuse-only fast path, 5.4% require patching, and 14.9% trigger skip-reuse.
Abstract（参考訳）: 繰り返し要求が共通のソリューション構造を共有するが、出力スキーマや変数名、数値定数といった局所的な制約が異なる、LLMサービスワークロードに対処する。従来のキャッシュアプローチでは、一般的にフルレスポンス(セマンティックキャッシュ)またはモデル内部KV/プレフィックスステートを再利用する。 StepCacheはバックエンドに依存しないステップレベルの再利用レイヤで、出力を順序付けられたステップに分割し、最適なキャッシュ要求を検索し、軽量なタスク認識チェックを使用してステップを検証する。 StepCacheはまた、シングルステップの抽出、必須キーの制約、ワンショットの修復、セマンティックな変更に対する保守的なスキップ-再利用のフォールバックを含む、JSONの厳格な構造化出力の強制をサポートする。線形方程式の場合、StepCacheは、バックエンドモデルが失敗する場合の正確性を保証する決定論的フォールバックで、境界付き修復ループによる検証の修正を促進する。 CPUのみの摂動重く、数学とJSONの変種に関するマイクロベンチマークでは、平均3つの種を平均して、StepCacheは平均レイテンシを2.13秒から0.67秒、中央レイテンシを2.42秒から0.01秒、p95レイテンシを3.38秒から3.30秒に短縮する。また、総トークン使用量を36.1kから27.3kに減らし、タスク固有のチェックと縫合出力整合性チェックの下で、エンドツーエンドの正しさを72.5%から100%に改善する。リクエスト全体では、79.7%が再利用のみの高速パス、5.4%がパッチが必要、14.9%がスキップ-リユースをトリガーする。

論文の概要: StepCache: Step-Level Reuse with Lightweight Verification and Selective Patching for LLM Serving

関連論文リスト