Fugu-MT 論文翻訳(概要): LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

論文の概要: LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

arxiv url: http://arxiv.org/abs/2510.19363v2
Date: Mon, 27 Oct 2025 01:55:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 13:14:10.609263
Title: LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts
Title（参考訳）: LoongRL:ロングコンテキストによる高度な推論のための強化学習
Authors: Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, Mao Yang,
Abstract要約: より進んだ長文推論のためのデータ駆動型RL法であるLoongRLを紹介する。 KeyChainは、短いマルチホップQAを高微分長文タスクに変換する合成手法である。 Qwen2.5-7Bと14Bでは、LongRLは長文マルチホップQAの精度を+23.5%、+21.1%向上させる。
参考スコア（独自算出の注目度）: 21.07202581368365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.
Abstract（参考訳）: 長いコンテキストに対する推論は、大きな言語モデルにとって不可欠である。強化学習(RL)は「Aha」モーメントをチェーンオブ思考に誘導することで短文推論を強化するが、長文推論に必要な先進的な思考パターンはほとんど探索されていないままであり、高精度なRLデータは乏しい。本稿では,データ駆動型RL法であるLoongRLを紹介する。 LoongRLの中心はKeyChainで、短いマルチホップのQAをUUIDチェーンを挿入することで、大量のドキュメントの真の疑問を隠すことで、高難易度な長文タスクに変換する。これらのタスクを解決するためには、正しい連鎖をステップバイステップで追跡し、真の疑問を特定し、関連する事実を検索し、正しい答えを導き出す必要がある。 KeyChainデータ上のRLトレーニングは、トレーニング長をはるかに超越した、創発的なプラン-レトリーブ-レアソン-リチェック推論パターンを誘導する。 16Kで訓練されたモデルは、フル長のRLロールアウトを禁止せずに、128Kタスクを効果的に解決する。 Qwen2.5-7Bと14Bでは、LongRLは長文マルチホップQAの精度を+23.5%、+21.1%向上させる。結果、LoongRL-14Bは74.2点に達し、O3-mini (74.5) やDeepSeek-R1 (74.9) といったより大型のフロンティアモデルと競合した。また、長文検索を改善し、128Kのニードル・イン・ア・ヘイスタックストレステストをすべてパスし、短文推論機能を保持する。

論文の概要: LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

関連論文リスト