Fugu-MT 論文翻訳(概要): Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning

論文の概要: Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning

arxiv url: http://arxiv.org/abs/2604.18639v1
Date: Sun, 19 Apr 2026 08:02:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.377838
Title: Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning
Title（参考訳）: データ効率のよい強化学習による自己進化型LLM
Authors: Zhiyin Yu, Bo Zhang, Qibin Hou, Zhonghai Wu, Xiao Luo, Lei Bai,
Abstract要約: 本稿では,認知学習理論に触発された新たな視点を導入し,EasyRLという新しいアプローチを提案する。 EasyRLは、簡単なラベル付きデータから信頼できる知識伝達を統合することで、人間の認知獲得曲線をシミュレートする。数学的および科学的ベンチマークの実験結果は、簡単にラベル付けされたデータの10%しか使用していないEasyRLが、一貫して最先端のベースラインを上回っていることを示している。
参考スコア（独自算出の注目度）: 59.25637976883812
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Previous LLMs-based RL studies typically follow either supervised learning with high annotation costs, or unsupervised paradigms using voting or entropy-based rewards. However, their performance remains far from satisfactory due to the substantial annotation cost and issues such as model collapse or reward hacking. To address these issues, we introduce a new perspective inspired by cognitive learning theory and propose a novel approach called EasyRL. The core of EasyRL is to simulate the human cognitive acquisition curve by integrating reliable knowledge transfer from easy labeled data with a progressive divide-and-conquer strategy that tackles increasingly difficult unlabeled data. Specifically, we initialize a warm-up model using supervised RL with few-shot labeled data. This is followed by a divide-and-conquer pseudo-labeling strategy on difficult unlabeled data, combining consistency-based selection for low-uncertainty cases and reflection-based resolution for medium-uncertainty cases. Finally, difficulty-progressive self-training with iterative pseudo-labeling and RL further strengthens the model's reasoning capability. EasyRL provides a unified self-evolving framework that facilitates data-efficient post-training of LLMs. Experimental results on mathematical and scientific benchmarks demonstrate that EasyRL, using only 10% of easy labeled data, consistently outperforms state-of-the-art baselines.
Abstract（参考訳）: 従来のLLMベースのRL研究は一般的に、高いアノテーションコストによる教師あり学習、または投票やエントロピーに基づく報酬を使った教師なしのパラダイムに従う。しかし、それらのパフォーマンスは、相当なアノテーションコストとモデルの崩壊や報酬のハッキングといった問題のために、まだ満足できないままです。これらの課題に対処するために、認知学習理論に触発された新たな視点を導入し、EasyRLと呼ばれる新しいアプローチを提案する。 EasyRLの中核は、容易なラベル付きデータからの信頼性の高い知識伝達と、ますます困難なラベル付きデータに対処するプログレッシブ・パーティション・アンド・コンカット戦略を統合することで、人間の認知獲得曲線をシミュレートすることである。具体的には、少数のラベル付きデータを用いた教師付きRLを用いてウォームアップモデルを初期化する。続いて、不確実なケースに対する一貫性に基づく選択と、中不確実なケースに対するリフレクションに基づく解決を組み合わせる。最後に、反復的な擬似ラベルとRLによる難易度の高い自己学習により、モデルの推論能力はさらに強化される。 EasyRLは、LLMのデータの効率的な後トレーニングを容易にする統合された自己進化フレームワークを提供する。数学的および科学的ベンチマークの実験結果は、簡単にラベル付けされたデータの10%しか使用していないEasyRLが、一貫して最先端のベースラインを上回っていることを示している。

論文の概要: Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning

関連論文リスト