Fugu-MT 論文翻訳(概要): Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning

論文の概要: Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.02073v1
Date: Sun, 03 May 2026 22:01:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.069481
Title: Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
Title（参考訳）: 探索駆動強化学習による逆関数最適化によるLLM推論の強化
Authors: Arash Ahmadi, Sarah Sharif, Yaser, Banad,
Abstract要約: 本稿では,報酬仕様自体を最適化の対象として扱う検索駆動型フレームワークを提案する。最高のアンサンブルは F1 = 0.795 95% ブートストラップ CI [0.756, 0.832]) と精度 0.660 [0.635, 0.686] を達成する。
参考スコア（独自算出の注目度）: 0.4285416351982749
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mathematical reasoning is a key benchmark for large language models. Reinforcement learning is a standard post-training mechanism for improving the reasoning capabilities of large language models, yet performance remains sensitive to the design of the reward function that drives policy optimization. This paper introduces a search-driven framework that treats the reward specification itself as an object of optimization. The setting of interest is one in which the base model is held fixed and the reward specification is the primary remaining design lever. Candidate reward functions are generated by a frontier language model, validated automatically, screened through 500-step Group Relative Policy Optimization (GRPO) training runs on a Llama-3.2-3B-Instruct base model with Low-Rank Adaptation (LoRA), and ranked by F1 on the GSM8K test set. Ranked summaries from prior rounds are then fed back into the next round of generation. Over five rounds, the search produces 50 candidate rewards. The mean F1 rises from 0.596 in Round 1 to 0.632 in Round 5, and the top individual reward reaches F1 = 0.787. Seven ensemble configurations of top-ranked rewards are evaluated. The best ensemble achieves F1 = 0.795 (95% bootstrap CI [0.756, 0.832]) and accuracy 0.660 [0.635, 0.686], a 0.19 absolute F1 gain over a base-rewards-only GRPO baseline (F1 = 0.609). Pairwise McNemar tests with Bonferroni correction show all five-or-more-reward configurations are statistically indistinguishable at α = 0.05/21. A three-seed re-training of the best ensemble yields F1 of 0.785. A randomly drawn 5-reward control collapses to F1 = 0.047, which shows that the ranked-feedback loop, not the additive signal of having more rewards, drives the gain.
Abstract（参考訳）: 数学的推論は、大きな言語モデルにとって重要なベンチマークである。強化学習(Reinforcement learning)は、大規模言語モデルの推論能力を改善するための標準的なポストトレーニングメカニズムであるが、性能は政策最適化を駆動する報酬関数の設計に敏感である。本稿では,報酬仕様自体を最適化の対象として扱う検索駆動型フレームワークを提案する。関心の設定は、ベースモデルを固定し、報酬仕様を主要な設計レバーとするものである。候補報酬関数はフロンティア言語モデルによって生成され、自動検証され、500ステップのグループ相対ポリシー最適化(GRPO)トレーニングによってLlama-3.2-3B-Instruct base model with Low-Rank Adaptation (LoRA)上で実行され、GSM8KテストセットでF1でランク付けされる。前のラウンドのランク付けされたサマリーは次のラウンドに返される。 5回以上のラウンドで、検索は50の候補報酬を生み出す。 F1の平均はラウンド1の0.596からラウンド5の0.632に上昇し、トップ個人報酬はF1 = 0.787に達する。トップランクの報酬のアンサンブル構成を7つ評価した。最高のアンサンブルは、F1 = 0.795 (95%ブートストラップCI [0.756, 0.832]) と精度 0.660 [0.635, 0.686] を達成する。 Pairwise McNemar test with Bonferroni corrects shows all five-or-more-reward configurations are statistically undistingishible at α = 0.05/21。ベストアンサンブルの3シード再訓練により、F1は0.785となる。ランダムに引き出された5逆制御はF1 = 0.047に崩壊し、より多くの報酬を持つ付加信号ではなく、ランク付けされたフィードバックループがゲインを駆動することを示す。

論文の概要: Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning

関連論文リスト