Fugu-MT 論文翻訳(概要): Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

論文の概要: Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

arxiv url: http://arxiv.org/abs/2601.22132v1
Date: Thu, 29 Jan 2026 18:52:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-30 16:22:50.09657
Title: Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference
Title（参考訳）: 答えではなくヒントに対する支払い: コスト効率の良い推論のためのLLMシェパーディング
Authors: Ziming Dong, Hardik Sharma, Evan O'Toole, Jaya Prakash Champati, Kui Wu,
Abstract要約: 小型言語モデル(SLM)は劇的なコスト削減を提供するが、精度はかなり遅い。 LLM Shepherdingは,LLMから短いプレフィックス(ヒント)のみを要求するフレームワークで,SLMに提供します。シェパードはルーティングとカスケードの両方を一般化し、オラクルの意思決定において低コストを実現する。
参考スコア（独自算出の注目度）: 7.865726406769634
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.
Abstract（参考訳）: 大規模言語モデル(LLM)は複雑な推論タスクに対して最先端のパフォーマンスを提供するが、その推論コストは大規模なデプロイメントを制限する。小型言語モデル(SLM)は劇的なコスト削減を提供するが、精度はかなり遅い。既存のアプローチ - ルーティングとカスケーディング - は LLM をオール・オー・ナッシングのリソースとして扱い、クエリが LLM を完全にバイパスするか、あるいは LLM が完全なレスポンスを全コストで生成する。 LLM Shepherdingは,LLMから短いプレフィックス(ヒント)のみを要求するフレームワークで,SLMに提供します。完全なLSM応答の10～30%からなるヒントでさえ、SLMの精度を大幅に向上させる。シェパードはルーティングとカスケードの両方を一般化し、オラクルの意思決定において低コストを実現する。我々は,ヒントが必要かどうか,要求するトークン数とを共同で決定する2段階予測器を開発した。広く使われている数学的推論(GSM8K, CNK12)とコード生成(HumanEval, MBPP)のベンチマークでは、ShepherdingはLSMのみの推論と比較してコストを42-94%削減する。最先端のルーティングとカスケードベースラインと比較すると、シェパードは精度良く2.8倍のコスト削減をもたらす。私たちの知る限り、これはSLM-LLMコラボレーションのためのトークンレベルの予算管理を利用する最初の作業です。

論文の概要: Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

関連論文リスト