Synthetic Data RL: Task Definition Is All You Need
- URL: http://arxiv.org/abs/2505.17063v1
- Date: Sun, 18 May 2025 05:35:13 GMT
- Title: Synthetic Data RL: Task Definition Is All You Need
- Authors: Yiduo Guo, Zhen Guo, Chuanwei Huang, Zi-Ang Wang, Zekai Zhang, Haofei Yu, Huishuai Zhang, Yikang Shen,
- Abstract summary: Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption.<n>We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition.<n>Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training.
- Score: 27.637296092731322
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.
Related papers
- Reinforcement Learning Fine-Tunes a Sparse Subnetwork in Large Language Models [0.0]
It is often assumed thatReinforcement learning (RL) fine-tuning requires updating most of a model's parameters.<n>We call this phenomenon RL-induced parameter update sparsity.<n>We show that fine-tuning only this sparse subnetwork recovers full model performance and yields parameters nearly identical to the fully fine-tuned model.
arXiv Detail & Related papers (2025-07-23T01:02:17Z) - Intention-Conditioned Flow Occupancy Models [69.79049994662591]
Large-scale pre-training has fundamentally changed how machine learning research is done today.<n>Applying this same framework to reinforcement learning is appealing because it offers compelling avenues for addressing core challenges in RL.<n>Recent advances in generative AI have provided new tools for modeling highly complex distributions.
arXiv Detail & Related papers (2025-06-10T15:27:46Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use [38.51141014728375]
Reinforcement learning has been shown to improve the performance of large language models.<n>We propose a synthetic data generation and RL methodology targeting multi-step optimization scenarios.<n>Step-Wise Reinforcement Learning (SWiRL) iteratively generates multi-step reasoning and tool use data, and then learns from that data.
arXiv Detail & Related papers (2025-04-07T05:20:58Z) - LIMR: Less is More for RL Scaling [25.477841726836836]
We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples.<n>Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset.<n>For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models.
arXiv Detail & Related papers (2025-02-17T15:13:29Z) - Value-Based Deep RL Scales Predictably [100.21834069400023]
We show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior.<n>We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym.
arXiv Detail & Related papers (2025-02-06T18:59:47Z) - Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining [49.730897226510095]
We introduce JOWA: Jointly-Reinforced World-Action model, an offline model-based RL agent pretrained on Atari games with 6 billion tokens data.<n>Our largest agent, with 150 million parameters, 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange.
arXiv Detail & Related papers (2024-10-01T10:25:03Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - Training Question Answering Models From Synthetic Data [26.91650323300262]
This work aims to narrow the gap between synthetic and human-generated question-answer pairs.
We synthesize questions and answers from a synthetic corpus generated by an 8.3 billion parameter GPT-2 model.
With no access to human supervision and only access to other models, we are able to train state of the art question answering networks on entirely model-generated data.
arXiv Detail & Related papers (2020-02-22T01:49:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.