Fugu-MT 論文翻訳(概要): GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

論文の概要: GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

arxiv url: http://arxiv.org/abs/2605.19577v1
Date: Tue, 19 May 2026 09:21:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.226436
Title: GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment
Title（参考訳）: GoLongRL:マルチタスクアライメントによる機能指向長コンテキスト強化学習
Authors: Minxuan Lv, Tiehua Mei, Tanlong Du, Junmin Chen, Zhenpeng Su, Ziyang Chen, Ziqi Wang, Zhennan Wu, Ruotong Pan, jian Liang, Ruiming Tang, Han Li,
Abstract要約: GoLongRLは、長文強化学習のための能力指向のポストトレーニングレシピで、検証可能な報酬がある。オープンに、23K RLVRサンプルのデータセット、完全な構築パイプライン、すべてのトレーニングコードをリリースしています。同じバニラGRPOセットアップの下では、私たちのデータセットはクローズドソースのQwenLong-L1.5データセットよりも優れています。
参考スコア（独自算出の注目度）: 46.47136353104916
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.
Abstract（参考訳）: 検証可能な報酬(RLVR)を用いた長文強化学習のための,完全オープンソースで機能指向のポストトレーニングレシピであるGoLongRLを提案する。既存の長いコンテキストRL法は、データ構築を、ますます複雑な検索経路を設計する問題として扱うことが多く、これは、実際的な長いコンテキストの要求を適切に反映しない均質なタスクカバレッジと報酬の定式化に繋がる。私たちの仕事は2つの貢献をする。 1) 完全開放型機能指向データ構築。オープンに、23K RLVRサンプルのデータセット、完全な構築パイプライン、すべてのトレーニングコードをリリースしています。長いコンテキスト能力の分類によってガイドされ、データセットは9つのタスクタイプにまたがっており、それぞれがその自然な評価基準とペアになっている。確立されたコーパスのオープンソースサンプルと、書籍、学術論文、マルチターン対話などの実際のソース文書からQAペアを生成する合成サンプルを含む。同じバニラGRPOセットアップの下では、私たちのデータセットはクローズドソースのQwenLong-L1.5データセットよりも優れています。さらに、このデータに基づいてトレーニングされたQwen3-30B-A3Bモデルでは、DeepSeek-R1-0528やQwen3-235B-A22B-Thinking-2507に匹敵する長文パフォーマンスを実現しています。 2) TMN-Reweight for heterogeneous multitask optimization。不均一な報酬からの最適化課題を解決するために,タスクレベルの平均正規化と,より信頼性の高い優位性推定のための困難適応重み付けを組み合わせたTMN-Reweightを提案する。 TMN-ReweightはバニラGRPOよりも平均性能を向上し、報告された評価を総合的に維持または改善する。

論文の概要: GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

関連論文リスト