Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards
- URL: http://arxiv.org/abs/2601.22511v1
- Date: Fri, 30 Jan 2026 03:43:42 GMT
- Title: Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards
- Authors: Yuan-Jay Lü, Chengyu Wang, Lei Shen, Jun Huang, Tong Xu,
- Abstract summary: Existing open-source agentic training data are narrow in task variety and easily solved.<n>Real-world APIs lack diversity and are unstable for large-scale reinforcement learning rollout processes.<n>We address these challenges with SYNTHAGENT, a framework that jointly synthesizes diverse tool-use training data and simulates complete environments.
- Score: 13.784988950752195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Small LLMs often struggle to match the agentic capabilities of large, costly models. While reinforcement learning can help, progress has been limited by two structural bottlenecks: existing open-source agentic training data are narrow in task variety and easily solved; real-world APIs lack diversity and are unstable for large-scale reinforcement learning rollout processes. We address these challenges with SYNTHAGENT, a framework that jointly synthesizes diverse tool-use training data and simulates complete environments. Specifically, a strong teacher model creates novel tasks and tool ecosystems, then rewrites them into intentionally underspecified instructions. This compels agents to actively query users for missing details. When handling synthetic tasks, an LLM-based user simulator provides user-private information, while a mock tool system delivers stable tool responses. For rewards, task-level rubrics are constructed based on required subgoals, user-agent interactions, and forbidden behaviors. Across 14 challenging datasets in math, search, and tool use, models trained on our synthetic data achieve substantial gains, with small models outperforming larger baselines.
Related papers
- Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning [62.499592503950026]
Large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments.<n>We propose Agent World Model (AWM), a fully synthetic environment generation pipeline.<n>We scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets.
arXiv Detail & Related papers (2026-02-10T18:55:41Z) - FunReason-MT Technical Report: Overcoming the Complexity Barrier in Multi-Turn Function Calling [39.45732462111156]
We present FunReason-MT, a novel data synthesis framework for real-world multi-turn tool use.<n>FunReason-MT resolves the complexity barrier in multi-turn FC data by employing Environment-API Graph Interactions.<n>A 4B model built upon FunReason-MT generated data achieves state-of-the-art performance among comparable-sized models.
arXiv Detail & Related papers (2025-10-28T17:15:26Z) - Advancing SLM Tool-Use Capability using Reinforcement Learning [0.0]
The ability to use tools effectively has become a defining feature of Large Language Models (LLMs), allowing them to access external data and internal resources.<n>Small Language Models (SLMs) face challenges in accurately integrating tool use, especially in resource-constrained settings.<n>This study investigates how Reinforcement Learning, specifically Group Relative Policy Optimization, can enhance the tool-use of SLMs.
arXiv Detail & Related papers (2025-09-03T07:41:14Z) - LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback [121.78866929908871]
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data.<n>We present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback.<n>Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback.
arXiv Detail & Related papers (2025-06-02T22:36:02Z) - SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [31.921127664873882]
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks.<n>High-quality training data is scarce, especially data that reflects real-world SWE scenarios.<n>Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks.
arXiv Detail & Related papers (2025-05-26T18:01:00Z) - RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs [3.41612427812159]
In digital content creation tools, users express their needs through natural language queries that must be mapped to API calls.<n>Existing approaches to synthetic data generation fail to replicate real-world data distributions.<n>We present a novel router-based architecture that generates high-quality synthetic training data.
arXiv Detail & Related papers (2025-05-15T16:53:45Z) - MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z) - ActionStudio: A Lightweight Framework for Data and Training of Large Action Models [88.90834854360641]
ActionStudio is a lightweight and scalable data and training framework designed for large action models.<n>Our trained models yield top performances across public and realistic agent benchmarks.<n>We open-source the ActionStudio framework and release actionstudio-98k, a curated dataset of 98k high-quality trajectories.
arXiv Detail & Related papers (2025-03-28T17:58:33Z) - FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response [25.400882898288216]
We introduce a dataset and pipeline to create Field Reasoning and Instruction Decoding Agent (FRIDA) models.<n>In our pipeline, domain experts and linguists combine their knowledge to make high-quality, few-shot prompts.<n>We fine-tune several small instruction-tuned models and find that ablated FRIDA models only trained on objects' physical state and function data outperformed both the FRIDA models trained on all synthetic data and the base models in our evaluation.
arXiv Detail & Related papers (2025-02-25T18:51:06Z) - Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.