EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments
- URL: http://arxiv.org/abs/2602.16179v4
- Date: Mon, 23 Feb 2026 06:33:42 GMT
- Title: EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments
- Authors: Sushant Mehta, Logan Ritchie, Suhaas Garre, Ian Niebres, Nick Heiner, Edwin Chen,
- Abstract summary: We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution.<n>We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments.
- Score: 0.10934862523101825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments. CoreCraft is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM 4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on Tau2-Bench Retail, and +6.8% on Tool Decathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.
Related papers
- Hybrid-Gym: Training Coding Agents to Generalize Across Tasks [59.95803522351185]
In this paper, we describe some transferable skills that are shared across diverse tasks.<n>We propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks.<n>Experiments show that agents trained on our synthetic tasks effectively generalize diverse real-world tasks.
arXiv Detail & Related papers (2026-02-18T19:30:55Z) - CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments [1.6153514666902042]
Real organizational work requires managing many concurrent long-horizon tasks with interleaving, dependencies, and reprioritization.<n>We introduce Multi-Horizon Task Environments (MHTEs): a distinct problem class requiring coherent execution across dozens of interleaved tasks.<n>We identify four failure modes that cause baseline CUAs to degrade from 16.7% to 8.7% completion as load scales 25% to 100%.<n>We present CorpGen, an architecture-agnostic framework addressing these failures via hierarchical planning for multi-horizon goal alignment.
arXiv Detail & Related papers (2026-02-15T16:54:34Z) - Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation [57.65688895630163]
We introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data.<n>Our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without forgetting existing environments.
arXiv Detail & Related papers (2026-02-10T23:06:02Z) - Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning [62.499592503950026]
Large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments.<n>We propose Agent World Model (AWM), a fully synthetic environment generation pipeline.<n>We scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets.
arXiv Detail & Related papers (2026-02-10T18:55:41Z) - Endless Terminals: Scaling RL Environments for Terminal Agents [39.60665149203152]
Endless Terminals is a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation.<n>We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop.<n>These improvements transfer to human-curated benchmarks.
arXiv Detail & Related papers (2026-01-23T04:39:55Z) - AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts [35.52607495764441]
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production.<n>We introduce AgencyBench, a benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios.<n>These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve.
arXiv Detail & Related papers (2026-01-16T07:22:20Z) - Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem [90.17610617854247]
We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimize the production pipeline for agentic model.<n>ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering.<n>We release ROME, an open-source agent grounded by ALE and trained on over one million trajectories.
arXiv Detail & Related papers (2025-12-31T14:03:39Z) - Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z) - Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation [65.3648667980258]
Vision-language model (VLM) based GUI agents show promise for automating complex tasks, but face significant challenges in applying reinforcement learning (RL)<n>We propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner.<n>On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA.
arXiv Detail & Related papers (2025-09-28T13:19:20Z) - AWorld: Orchestrating the Training Recipe for Agentic AI [35.94278765364194]
We introduce AWorld, an open-source system engineered for large-scale agent-environment interaction.<n>By distributing tasks across a cluster, AWorld accelerates experience collection by 14.6x compared to standard single-node, sequential execution.<n>We trained a Qwen3-32B-based agent that achieves pass@1 accuracy of 32.23% on the GAIA test set.
arXiv Detail & Related papers (2025-08-28T04:04:30Z) - Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models [33.1538965735133]
Cybench is a framework for specifying cybersecurity tasks and evaluating agents on those tasks.<n>We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions.<n>We construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct.
arXiv Detail & Related papers (2024-08-15T17:23:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.