Fugu-MT 論文翻訳(概要): ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding

論文の概要: ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding

arxiv url: http://arxiv.org/abs/2508.19576v2
Date: Mon, 08 Sep 2025 13:12:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-09 14:07:03.32185
Title: ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding
Title（参考訳）: ReST-RL: 最適化された自己学習とデコードによるLLMの正確なコード推論を実現する
Authors: Sining Zhoubian, Dan Zhang, Jie Tang,
Abstract要約: 本稿では,統一LLM RLパラダイムであるReST-RLを紹介する。改良されたGRPOアルゴリズムと、値モデル(VM)が補助する精密に設計されたテスト時間復号法を組み合わせる。提案するRLパラダイムの有効性を検証するために,符号化問題に関する広範な実験を行った。
参考スコア（独自算出の注目度）: 15.051729280454454
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With respect to improving the reasoning accuracy of LLMs, the representative reinforcement learning (RL) method GRPO faces failure due to insignificant reward variance, while verification methods based on process reward models (PRMs) suffer from difficulties with training data acquisition and verification effectiveness. To tackle these problems, this paper introduces ReST-RL, a unified LLM RL paradigm that significantly improves LLM's code reasoning ability by combining an improved GRPO algorithm with a meticulously designed test time decoding method assisted by a value model (VM). As the first stage of policy reinforcement, ReST-GRPO adopts an optimized ReST algorithm to filter and assemble high-value training data, increasing the reward variance of GRPO sampling, thus improving the effectiveness and efficiency of training. After the basic reasoning ability of LLM policy has been improved, we further propose a test time decoding optimization method called VM-MCTS. Through Monte-Carlo Tree Search (MCTS), we collect accurate value targets with no annotation required, on which VM training is based. When decoding, the VM is deployed by an adapted MCTS algorithm to provide precise process signals as well as verification scores, assisting the LLM policy to achieve high reasoning accuracy. We conduct extensive experiments on coding problems to verify the validity of the proposed RL paradigm. Upon comparison, our approach significantly outperforms other reinforcement training baselines (e.g., naive GRPO and ReST-DPO), as well as decoding and verification baselines (e.g., PRM-BoN and ORM-MCTS) on well-known coding benchmarks of various levels (e.g., APPS, BigCodeBench, and HumanEval), indicating its power to strengthen the reasoning ability of LLM policies. Codes for our project can be found at https://github.com/THUDM/ReST-RL.
Abstract（参考訳）: LLMの推論精度を改善するため、GRPO法は、プロセス報酬モデル(PRM)に基づく検証手法は、トレーニングデータ取得と検証の有効性の難しさに悩まされる一方、重要な報酬分散による障害に直面している。これらの問題に対処するために,改良されたGRPOアルゴリズムと,値モデル(VM)を援用した精巧に設計されたテスト時間復号法を組み合わせることで,LLMのコード推論能力を大幅に向上する統一LLM RLパラダイムであるReST-RLを提案する。政策強化の第1段階として、ReST-GRPOは最適化されたReSTアルゴリズムを採用し、高価値なトレーニングデータをフィルタリングして組み立て、GRPOサンプリングの報酬分散を増大させ、トレーニングの有効性と効率を向上させる。 LLMポリシーの基本的推論能力が改善された後,VM-MCTSと呼ばれるテスト時間デコード最適化手法を提案する。 Monte-Carlo Tree Search (MCTS)を通じて、VMトレーニングをベースとしたアノテーションなしで正確な値ターゲットを収集する。復号時に、VMは適応MCTSアルゴリズムによって展開され、正確なプロセス信号と検証スコアを提供し、LCMポリシーを支援して高い推論精度を達成する。提案するRLパラダイムの有効性を検証するために,符号化問題に関する広範な実験を行った。比較すると,本手法は他の強化トレーニングベースライン(例えば GRPO や ReST-DPO など)や,様々なレベル(例えば,APPS,BigCodeBench,HumanEval)のコーディングベンチマークにおいて,復号および検証ベースライン(例えば,PRM-BoN や ORM-MCTS)よりも優れており,LCM ポリシーの推論能力を高める力を示している。プロジェクトのコードはhttps://github.com/THUDM/ReST-RL.comで確認できます。

論文の概要: ReST-RL: Achieving Accurate Code Reasoning of LLMs with Optimized Self-Training and Decoding

関連論文リスト