Fugu-MT 論文翻訳(概要): Optimal Stopping vs Best-of-$N$ for Inference Time Optimization

論文の概要: Optimal Stopping vs Best-of-$N$ for Inference Time Optimization

arxiv url: http://arxiv.org/abs/2510.01394v1
Date: Wed, 01 Oct 2025 19:25:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:20.841832
Title: Optimal Stopping vs Best-of-$N$ for Inference Time Optimization
Title（参考訳）: 推論時間最適化のための最適停止とベストオブN$
Authors: Yusuf Kalayci, Vinod Raman, Shaddin Dughmi,
Abstract要約: PandoraのBox問題に基づく推論時間最適化のための新しいフレームワークを提案する。そこで我々は,報酬分布を知らずにいつ生成を止めるかを決定するアルゴリズムを開発した。この結果から,最適停止理論と推定時間スケーリングの原則的ブリッジが確立された。
参考スコア（独自算出の注目度）: 11.334978981105559
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM) generation often requires balancing output quality against inference cost, especially when using multiple generations. We introduce a new framework for inference-time optimization based on the classical Pandora's Box problem. Viewing each generation as opening a costly "box" with random reward, we develop algorithms that decide when to stop generating without knowing the underlying reward distribution. Our first contribution is a UCB-style Pandora's Box algorithm, which achieves performance that is provably close to Weitzman's algorithm, the optimal strategy when the distribution is known. We further adapt this method to practical LLM settings by addressing reward scaling across prompts via a Bradley-Terry inspired transformation. This leads to an adaptive inference-time optimization method that normalizes rewards and learns stopping thresholds on the fly. Experiments on the AlpacaFarm and HH-RLHF datasets, using multiple LLM-reward model pairs, show that our adaptive strategy can obtain the same performance as non-adaptive Best-of-N sampling while requiring 15-35 percent fewer generations on average. Our results establish a principled bridge between optimal stopping theory and inference-time scaling, providing both theoretical performance bounds and practical efficiency gains for LLM deployment.
Abstract（参考訳）: 大規模言語モデル(LLM)の生成は、特に複数の世代を使用する場合、出力品質と推論コストのバランスを必要とすることが多い。 PandoraのBox問題に基づく推論時間最適化のための新しいフレームワークを提案する。それぞれの世代をランダムな報酬を伴うコストのかかる「箱」の開き方と見なして、報酬分布を知らずにいつ生成を止めるかを決定するアルゴリズムを開発する。最初のコントリビューションは UCB スタイルの Pandora の Box アルゴリズムで,分布が知られているときの最適戦略である Weitzman のアルゴリズムに近い性能を実現する。我々はBradley-Terryにインスパイアされた変換を通じて、プロンプト間の報酬スケーリングに対処することで、この手法を実用的なLCM設定に適用する。これにより、報酬を正規化し、ハエのしきい値の停止を学習する適応推論時間最適化法が導かれる。複数のLLM-リワードモデルペアを用いたAlpacaFarmとHH-RLHFデータセットの実験により、我々の適応戦略は、平均15～35%の世代で、非適応的Best-of-Nサンプリングと同じ性能が得られることを示した。本研究は,最適停止理論と推定時間スケーリングの原理的橋渡しを行い,LLM展開における理論的性能境界と実用的効率向上の両立を図った。

論文の概要: Optimal Stopping vs Best-of-$N$ for Inference Time Optimization

関連論文リスト