Fugu-MT 論文翻訳(概要): Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

論文の概要: Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

arxiv url: http://arxiv.org/abs/2604.18381v1
Date: Mon, 20 Apr 2026 15:04:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.960867
Title: Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
Title（参考訳）: 教訓から学ぶ:ローデータとコンピュータレジームにおけるRLVRの有効性の測定
Authors: Justin Bauer, Thomas Walshe, Derek Pham, Harit Vishwakarma, Armin Parchami, Frederic Sala, Paroma Varma,
Abstract要約: 微調整の大規模言語モデル(LLM)は、典型的には大量の高品質な注釈付きデータ、あるいは明確に定義された真実の答えを持つ質問に依存している。従来の研究は、RLVR(Reinforcement Learning with Verifiable Rewards)で使用されるデータと計算の両方をスケールすることで、推論能力をモデル化するメリットを探求してきた。本稿では、RLVR以降の低データ環境におけるオープンソースのSmall Language Model (SLM) の性能に関する総合的研究について述べる。
参考スコア（独自算出の注目度）: 18.00712219143378
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning, and spatial reasoning, we characterize how model performance scales with dataset size, diversity, and complexity. We demonstrate that (1) procedural datasets allow for fine-grained evaluation and training dataset development with controllable properties (size, diversity, and complexity), (2) under RLVR, models trained on lower complexity tasks can generalize to higher complexity tasks, and (3) training on mixed complexity datasets is associated with the greatest benefits in low data regimes, providing up to 5x sample efficiency versus training on easy tasks. These findings inspire future work on the development of data scaling laws for RLVR and the use of procedural data generators to further understand effective data development for efficient LLM fine-tuning.
Abstract（参考訳）: 微調整大型言語モデル(LLMs)は、典型的には、大量の高品質な注釈付きデータや、Reinforcement Learning with Verifiable Rewards (RLVR) の場合、明確に定義された真理の答えを持つ質問に依存している。これまでの研究では、RLVRで使用されるデータと計算の両方をスケールすることで、推論能力をモデル化するメリットについて検討してきたが、アノテーション付きデータやアクセス可能な計算が不足している実世界の多くの設定では、適用性に欠けていた。本稿では,RLVR以降の低データ環境下でのオープンソースのSmall Language Model (SLM) の性能に関する総合的研究について述べる。数値カウント問題,グラフ推論,空間推論の3つの新しいデータセットを網羅し,データセットのサイズ,多様性,複雑さによるモデルパフォーマンスのスケールを特徴付ける。論文では,(1)制御可能な特性(サイズ,多様性,複雑性)を持つデータセットの微粒化評価とトレーニングを可能にするプロシージャデータセット,(2)RLVRの下では,より低い複雑性タスクで訓練されたモデルがより複雑なタスクに一般化可能であること,(3)混合複雑性データセットのトレーニングは,低データ構造における最大のメリットと結びついており,簡単なタスクでのトレーニングに比べて最大5倍のサンプル効率が期待できることを示した。これらの知見は、RLVRにおけるデータスケーリング法則の開発と、効率的なLCM微調整のための効率的なデータ開発をより理解するための手続き型データジェネレータの利用に先立ち、今後の研究を刺激するものである。

論文の概要: Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

関連論文リスト