Fugu-MT 論文翻訳(概要): SPARK: Synergistic Policy And Reward Co-Evolving Framework

論文の概要: SPARK: Synergistic Policy And Reward Co-Evolving Framework

arxiv url: http://arxiv.org/abs/2509.22624v1
Date: Fri, 26 Sep 2025 17:50:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.625396
Title: SPARK: Synergistic Policy And Reward Co-Evolving Framework
Title（参考訳）: SPARK:Synergistic Policy and Reward Co-Evolving Framework
Authors: Ziyu Liu, Yuhang Zang, Shengyuan Ding, Yuhang Cao, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang,
Abstract要約: 我々は、RLVR上に構築された効率的でオン・ポリティクス、安定した手法であるSPARK(Synergistic Policy and Reward Co-Evolving Framework)を紹介する。ロールアウトと正確性データを捨てる代わりに、SPARKはこの貴重な情報をリサイクルし、生成的報酬モデルとしてモデル自体をトレーニングする。 SPARK は複数の LLM モデルと LVLM モデル,および複数の推論,報酬モデル,一般ベンチマークにおいて,大幅な性能向上を実現していることを示す。
参考スコア（独自算出の注目度）: 84.22494672256894
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) increasingly use Reinforcement Learning (RL) for post-pretraining, such as RL with Verifiable Rewards (RLVR) for objective tasks and RL from Human Feedback (RLHF) for subjective tasks. However, RLHF incurs high costs and potential reward-policy mismatch due to reliance on human preferences, while RLVR still wastes supervision by discarding rollouts and correctness signals after each update. To address these challenges, we introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR. Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model. This auxiliary training uses a mix of objectives, such as pointwise reward score, pairwise comparison, and evaluation conditioned on further-reflection responses, to teach the model to evaluate and improve its own responses. Our process eliminates the need for a separate reward model and costly human preference data. SPARK creates a positive co-evolving feedback loop: improved reward accuracy yields better policy gradients, which in turn produce higher-quality rollouts that further refine the reward model. Our unified framework supports test-time scaling via self-reflection without external reward models and their associated costs. We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks. For example, SPARK-VL-7B achieves an average 9.7% gain on 7 reasoning benchmarks, 12.1% on 2 reward benchmarks, and 1.5% on 8 general benchmarks over the baselines, demonstrating robustness and broad generalization.
Abstract（参考訳）: 近年のLLM(Large Language Models)やLVLM(Large Vision-Language Models)では、目標タスクにRL(Verifiable Rewards)、主観タスクにRL(Human Feedback、RLHF)などの強化学習(Reinforcement Learning、RL)が採用されている。しかし、RLHFは人間の嗜好に依存しているため、高いコストと潜在的報酬-政治ミスマッチを引き起こし、一方RLVRは更新後にロールアウトと正当性信号を捨てて監督を無駄にしている。これらの課題に対処するために、RLVR上に構築された効率的でオン・ポリシー、安定した手法であるSPARK(Synergistic Policy and Reward Co-Evolving Framework)を導入する。ロールアウトと正確性データを捨てる代わりに、SPARKはこの貴重な情報をリサイクルし、生成的報酬モデルとしてモデル自体をトレーニングする。この補助訓練は、ポイントワイズ報酬スコア、ペアワイズ比較、さらなる反射反応に条件づけられた評価などの目的の混合を用いて、モデルに自身の反応を評価し改善させる。当社のプロセスでは、報酬モデルとコストのかかる人選好データの必要性が排除されている。 SPARKは肯定的な共進化的なフィードバックループを生成し、報酬精度の向上はより良いポリシー勾配をもたらし、それによって報酬モデルをさらに洗練する高品質なロールアウトを生成する。当社の統合フレームワークは、外部報酬モデルと関連するコストを使わずに、セルフリフレクションによるテスト時間のスケーリングをサポートしています。 SPARK は複数の LLM モデルと LVLM モデル,および複数の推論,報酬モデル,一般ベンチマークにおいて,大幅な性能向上を実現していることを示す。例えば、SPARK-VL-7Bは7つの推論ベンチマークで平均9.7%、報酬ベンチマークで12.1%、ベースライン上での8つの一般的なベンチマークで1.5%、堅牢性と広範な一般化を示す。

論文の概要: SPARK: Synergistic Policy And Reward Co-Evolving Framework

関連論文リスト