Fugu-MT 論文翻訳(概要): Mitigating Premature Exploitation in Particle-based Monte Carlo for Inference-Time Scaling

論文の概要: Mitigating Premature Exploitation in Particle-based Monte Carlo for Inference-Time Scaling

arxiv url: http://arxiv.org/abs/2510.05825v1
Date: Tue, 07 Oct 2025 11:48:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:08.230774
Title: Mitigating Premature Exploitation in Particle-based Monte Carlo for Inference-Time Scaling
Title（参考訳）: 推測時間スケーリングのための粒子系モンテカルロの早期爆発の緩和
Authors: Giorgio Giannone, Guangxuan Xu, Nikhil Shivakumar Nayak, Rohan Mahesh Awhad, Shivchander Sudalairaj, Kai Xu, Akash Srivastava,
Abstract要約: 推論時間スケーリング(ITS)は、世代毎により多くの計算を割り当てることで、言語モデルを改善する。 PFは複雑な数学的推論タスクのための強力なITS手法として登場した。プロセス報酬モデルによって導かれると脆弱性があり、しばしば推論プロセスの早い段階で過信のスコアを割り当てる。この障害モードは、特に制約された計算予算の下では、パーティクル・インバディション(Particle Impoverishment)として知られている。本稿では,この問題を解決するために2つの新しい手法を統合するアルゴリズムであるEntropic Particle Filtering (ePF)を紹介する。
参考スコア（独自算出の注目度）: 15.828750560145751
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inference-Time Scaling (ITS) improves language models by allocating more computation at generation time. Particle Filtering (PF) has emerged as a strong ITS method for complex mathematical reasoning tasks, but it is vulnerable when guided by process reward models, which often assign overconfident scores early in the reasoning process. This causes PF to suffer from premature exploitation: it myopically commits to locally promising trajectories, prunes potentially correct hypotheses, and converges to suboptimal solutions. This failure mode, known as particle impoverishment, is especially severe under constrained computational budgets. To address this, we analyze the problem and identify two root causes: a lack of diversity in the particle set due to overconfident resampling and consequent inability to assess the potential of a reasoning path. We introduce Entropic Particle Filtering (ePF), an algorithm that integrates two new techniques to solve these issues. The first technique, Entropic Annealing (EA), directly mitigates particle impoverishment by monitoring search diversity via entropy; when diversity drops, it intervenes by dynamically annealing the resampling distribution to preserve exploration. The second, an enhancement called Look-ahead Modulation (LaM), adds a predictive guide to evaluate a state's potential based on its successors. On several challenging math benchmarks, ePF significantly outperforms strong baselines and achieves up to a 50 % relative improvement in task reward. Together, these methods improve PF's resilience by balancing the exploration of diverse solution spaces with the exploitation of high-reward regions, ultimately leading to higher-quality solutions.
Abstract（参考訳）: 推論時間スケーリング(ITS)は、世代毎により多くの計算を割り当てることで、言語モデルを改善する。 PF(Particle Filtering)は、複雑な数学的推論タスクのための強力なITS手法として登場したが、プロセス報酬モデルによって導かれると脆弱である。 PFはミオプティックに、局所的に有望な軌道にコミットし、プーンは仮説を正し、準最適解に収束する。この障害モードは、特に制約された計算予算の下では、パーティクル・インバディション(Particle Impoverishment)として知られている。この問題に対処するために、我々は問題を解析し、2つの根本原因を同定する: 粒子集合の多様性の欠如過剰な再サンプリングとそれに伴う推論経路の可能性を評価することができない。本稿では,この問題を解決するために2つの新しい手法を統合するアルゴリズムであるEntropic Particle Filtering (ePF)を紹介する。第一の手法であるエントロピーアニーリング(EA)は、エントロピーによる探索の多様性の監視によって直接粒子の汚染を軽減し、多様性が低下すると、再サンプリング分布を動的に加熱して探索を保存する。第二に、Look-ahead Modulation (LaM)と呼ばれる拡張は、その後継者に基づいて州のポテンシャルを評価するための予測ガイドを追加する。いくつかの挑戦的な数学ベンチマークでは、ePFは強いベースラインを著しく上回り、タスク報酬の相対的な改善を最大50%達成している。これらの手法が組み合わさって、多種多様な解空間の探索と高次領域の活用のバランスをとることにより、PFのレジリエンスが向上し、最終的には高品質な解へと繋がる。

論文の概要: Mitigating Premature Exploitation in Particle-based Monte Carlo for Inference-Time Scaling

関連論文リスト