Related papers: EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

URL: http://arxiv.org/abs/2601.05808v1
Date: Fri, 09 Jan 2026 14:32:06 GMT
Title: EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
Authors: Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen,
Abstract summary: Large language models (LLMs) are expected to be trained to act as agents in various real-world environments.<n>This process relies on rich and varied tool-interaction sandboxes.<n>We propose EnvScaler, an automated framework for scalable tool-interaction environments.
Score: 101.67583081810136
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at https://github.com/RUC-NLPIR/EnvScaler.

Related papers

SWE-Hub: A Unified Production System for Scalable, Executable Software Engineering Tasks [10.106518618464888]
SWE-Hub is an end-to-end system that operationalizes the data factory abstraction.<n>It unifies environment automation, scalable synthesis, and diverse task generation into a coherent production stack.
arXiv Detail & Related papers (2026-02-28T09:53:48Z)
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning [62.499592503950026]
Large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments.<n>We propose Agent World Model (AWM), a fully synthetic environment generation pipeline.<n>We scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets.
arXiv Detail & Related papers (2026-02-10T18:55:41Z)
VirtualEnv: A Platform for Embodied AI Research [26.527818430035534]
We present VirtualEnv, a next-generation simulation platform built on Unreal Engine 5.<n>It enables fine-grained benchmarking of large language models (LLMs) in embodied and interactive scenarios.<n>We provide a user-friendly API built on top of Unreal Engine, allowing researchers to deploy and control LLM-driven agents.
arXiv Detail & Related papers (2026-01-12T14:04:38Z)
Simulating Environments with Reasoning Models for Agent Training [55.98861707136674]
Building bespoke environments for training is heavy, brittle, and limits progress.<n>We propose two frameworks: Simia-SFT and Simia-RL.<n>Simia-SFT and Simia-RL enable scalable agent training without environment engineering.
arXiv Detail & Related papers (2025-11-03T18:29:57Z)
PIPer: On-Device Environment Setup via Online Reinforcement Learning [74.52354321028493]
Automated environment setup methods could assist developers by providing fully configured environments for arbitrary repositories without manual effort.<n>Recent studies reveal that even state-of-the-art Large Language Models (LLMs) achieve limited success in automating this task.<n>We combine supervised fine-tuning for generating correct scripts and Reinforcement Learning with Verifiable Rewards (RLVR) to adapt it to the task of environment setup.<n>On EnvBench-Python, our method enables Qwen3-8B (a model runnable on consumer hardware) to perform on par with larger models-Qwen3-32B and GPT-4
arXiv Detail & Related papers (2025-09-29T20:03:05Z)
Generalizable End-to-End Tool-Use RL with Synthetic CodeGym [52.31172214690965]
We introduce CodeGym, a framework that synthesizes diverse, verifiable, and controllable multi-turn tool-use environments for agent RL.<n>CodeGym rewrites static coding problems into interactive environments by extracting atomic functions or logic into callable tools.<n>Models of varying sizes and chain-of-thought configurations, trained in CodeGym, exhibit consistent out-of-distribution generalizability.
arXiv Detail & Related papers (2025-09-22T03:03:56Z)
One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning [32.13266149565313]
Multi-task world models like UniZero excel in single-task settings.<n>We find that gradient conflicts and the loss of model plasticity often constrain their sample efficiency.<n>In this work, we address these challenges from two complementary perspectives: the single learning iteration and the overall learning process.
arXiv Detail & Related papers (2025-09-09T17:27:53Z)
SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving [90.32201622392137]
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs)<n>Unlike traditional static benchmarks, SwingArena models the collaborative process of software by pairing LLMs as iterations, who generate patches, and reviewers, who create test cases and verify the patches through continuous integration (CI) pipelines.
arXiv Detail & Related papers (2025-05-29T18:28:02Z)
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents [31.921127664873882]
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks.<n>High-quality training data is scarce, especially data that reflects real-world SWE scenarios.<n>Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks.
arXiv Detail & Related papers (2025-05-26T18:01:00Z)
The Compressor-Retriever Architecture for Language Model OS [20.56093501980724]
This paper explores the concept of using a language model as the core component of an operating system (OS) A key challenge in realizing such an LM OS is managing the life-long context and ensuring statefulness across sessions. We introduce compressor-retriever, a model-agnostic architecture designed for life-long context management.
arXiv Detail & Related papers (2024-09-02T23:28:15Z)
CoRL: Environment Creation and Management Focused on System Integration [0.0]
The Core Reinforcement Learning library (CoRL) is a modular, composable, and hyper-configurable environment creation tool. It allows minute control over agent observations, rewards, and done conditions through the use of easy-to-read configuration files, pydantic validators, and a functor design pattern.
arXiv Detail & Related papers (2023-03-03T19:01:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.