Fugu-MT 論文翻訳(概要): Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

論文の概要: Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

arxiv url: http://arxiv.org/abs/2603.16140v1
Date: Tue, 17 Mar 2026 05:48:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.116717
Title: Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards
Title（参考訳）: ノイズデータは、検証可能なリワードで強化学習を損なう
Authors: Yuxuan Zhu, Daniel Kang,
Abstract要約: 検証可能な報酬付き強化学習(RLVR)は、様々な領域にわたる大規模言語モデルの最近の能力向上を促している。近年の研究では、改良されたRLVRアルゴリズムにより、間違ったアノテーションからモデルが効果的に学習できることが示唆されている。 100%ノイズのあるトレーニングデータがクリーンなデータで"汚染"されているため,これらの結果は無効であることを示す。
参考スコア（独自算出の注目度）: 9.797159765512236
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has driven recent capability advances of large language models across various domains. Recent studies suggest that improved RLVR algorithms allow models to learn effectively from incorrect annotations, achieving performance comparable to learning from clean data. In this work, we show that these findings are invalid because the claimed 100% noisy training data is "contaminated" with clean data. After rectifying the dataset with a rigorous re-verification pipeline, we demonstrate that noise is destructive to RLVR. We show that existing RLVR algorithm improvements fail to mitigate the impact of noise, achieving similar performance to that of the basic GRPO. Furthermore, we find that the model trained on truly incorrect annotations performs 8-10% worse than the model trained on clean data across mathematical reasoning benchmarks. Finally, we show that these findings hold for real-world noise in Text2SQL tasks, where training on real-world, human annotation errors cause 5-12% lower accuracy than clean data. Our results show that current RLVR methods cannot yet compensate for poor data quality. High-quality data remains essential.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は、様々な領域にわたる大規模言語モデルの最近の能力向上を促している。近年の研究では、改良されたRLVRアルゴリズムにより、モデルが誤ったアノテーションから効果的に学習でき、クリーンデータからの学習に匹敵するパフォーマンスを達成することが示唆されている。本研究では,100%ノイズのあるトレーニングデータがクリーンなデータで汚染されているため,これらの結果は無効であることを示す。厳密な再検証パイプラインでデータセットを修正した後、ノイズがRLVRに破壊的であることを示す。既存のRLVRアルゴリズムの改善はノイズの影響を軽減するのに失敗し、基本的なGRPOと同じような性能を実現していることを示す。さらに、真に正しくないアノテーションでトレーニングされたモデルは、数学的推論ベンチマークでトレーニングされたクリーンデータよりも8～10%悪い結果が得られた。最後に、これらの結果は、実世界の人間のアノテーションエラーがクリーンデータよりも5～12%低い精度で、Text2SQLタスクにおける実世界のノイズを抑えることを示す。以上の結果から,現在のRLVR法ではデータ品質の低下を補うことができないことがわかった。高品質なデータは依然として不可欠である。

論文の概要: Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

関連論文リスト