Fugu-MT 論文翻訳(概要): GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

論文の概要: GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

arxiv url: http://arxiv.org/abs/2605.15464v1
Date: Thu, 14 May 2026 23:05:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.121609
Title: GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero
Title（参考訳）: GRLO:ゼロから開放された環境における一般化可能な強化学習を目指して
Authors: Shangjian Yin, Yu Fu, Yue Dong, Zhouxing Shi,
Abstract要約: ポストトレーニングは、大きな言語モデルの能力をアンロックするための重要なステップになっている。オープンエンド環境における小さな相互作用の集合からスクラッチから学習したRLHFの一般化能力について検討した。提案手法は,数学的推論やコード生成といった下流タスクに暗黙的に移行できるかどうかを考察する。
参考スコア（独自算出の注目度）: 15.236247092411164
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open-ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about $46\times$ less data and $68\times$ less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen's released post-trained models which required a much larger training cost. Notably, a subsequent in-domain RLVR stage brings only selective gains, mainly on harder competition-math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post-trained models. Our code and data will be available at: \href{https://github.com/SJY8460/GRLO}{https://github.com/SJY8460/GRLO}.
Abstract（参考訳）: ポストトレーニングは、大規模な言語モデルの能力を解放するための重要なステップとなり、強化学習(RL)が重要なパラダイムとして登場した。近年、RLベースのポストトレーニングは、人間からのフィードバックによる強化学習(RLHF)と、検証済みの環境で動作する検証可能な報酬からの強化学習(RLVR)の2つのパラダイムに分かれている。後者は、ドメイン固有のタスク(例えば、推論)において、より強力な利得と高い効率をもたらすため、最近の推論指向のポストトレーニングを支配している。しかし、ドメイン内RLトレーニングは有望なパフォーマンスを達成するが、それでも相当量のGPU計算を必要とするため、広く採用するには依然として大きな障壁である。本研究では,オープンエンド環境における小さな相互作用から学習したRLHFの一般化能力について検討し,数学的推論やコード生成などの下流タスク,すなわちGRLOに暗黙的に伝達できる会話能力について検討する。具体的には、Qwen3-4B-Baseのバックボーンでは、GRLOは24.1から63.1までの全てのドメインの平均性能を5Kプロンプトと22.7GPU時間で改善している。結果として得られたモデルは、Qwenがリリースしたトレーニング後のモデルと競合する。特に、その後のドメイン内RLVRステージは、主に厳しい競合質量ベンチマークに基づいて、選択的なゲインしか得られない。 GRLOは、幅広い能力を持つポストトレーニングモデルを構築するためのシンプルで効率的なレシピを提供してくれることを願っています。コードとデータは以下の通りである。 \href{https://github.com/SJY8460/GRLO}{https://github.com/SJY8460/GRLO}。

論文の概要: GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

関連論文リスト