Related papers: EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

URL: http://arxiv.org/abs/2602.16179v4
Date: Mon, 23 Feb 2026 06:33:42 GMT
Title: EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments
Authors: Sushant Mehta, Logan Ritchie, Suhaas Garre, Ian Niebres, Nick Heiner, Edwin Chen,
Abstract summary: We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution.<n>We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments.
Score: 0.10934862523101825
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce CoreCraft, the first environment in EnterpriseBench, Surge AI's suite of agentic RL environments. CoreCraft is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM 4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37% to 36.76% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5% on BFCL Parallel, +7.4% on Tau2-Bench Retail, and +6.8% on Tool Decathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.

Related papers

Hybrid-Gym: Training Coding Agents to Generalize Across Tasks [59.95803522351185]
In this paper, we describe some transferable skills that are shared across diverse tasks.<n>We propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks.<n>Experiments show that agents trained on our synthetic tasks effectively generalize diverse real-world tasks.
arXiv Detail & Related papers (2026-02-18T19:30:55Z)
CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments [1.6153514666902042]
Real organizational work requires managing many concurrent long-horizon tasks with interleaving, dependencies, and reprioritization.<n>We introduce Multi-Horizon Task Environments (MHTEs): a distinct problem class requiring coherent execution across dozens of interleaved tasks.<n>We identify four failure modes that cause baseline CUAs to degrade from 16.7% to 8.7% completion as load scales 25% to 100%.<n>We present CorpGen, an architecture-agnostic framework addressing these failures via hierarchical planning for multi-horizon goal alignment.
arXiv Detail & Related papers (2026-02-15T16:54:34Z)
Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation [57.65688895630163]
We introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data.<n>Our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without forgetting existing environments.
arXiv Detail & Related papers (2026-02-10T23:06:02Z)
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning [62.499592503950026]
Large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments.<n>We propose Agent World Model (AWM), a fully synthetic environment generation pipeline.<n>We scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets.
arXiv Detail & Related papers (2026-02-10T18:55:41Z)
Endless Terminals: Scaling RL Environments for Terminal Agents [39.60665149203152]
Endless Terminals is a fully autonomous pipeline that procedurally generates terminal-use tasks without human annotation.<n>We train agents using vanilla PPO with binary episode level rewards and a minimal interaction loop.<n>These improvements transfer to human-curated benchmarks.
arXiv Detail & Related papers (2026-01-23T04:39:55Z)
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts [35.52607495764441]
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production.<n>We introduce AgencyBench, a benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios.<n>These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve.
arXiv Detail & Related papers (2026-01-16T07:22:20Z)
Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem [90.17610617854247]
We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimize the production pipeline for agentic model.<n>ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering.<n>We release ROME, an open-source agent grounded by ALE and trained on over one million trajectories.
arXiv Detail & Related papers (2025-12-31T14:03:39Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
Efficient Multi-turn RL for GUI Agents via Decoupled Training and Adaptive Data Curation [65.3648667980258]
Vision-language model (VLM) based GUI agents show promise for automating complex tasks, but face significant challenges in applying reinforcement learning (RL)<n>We propose DART, a Decoupled Agentic RL Training framework for GUI agents, which coordinates heterogeneous modules in a highly decoupled manner.<n>On the OSWorld benchmark, DART-GUI-7B achieves a 42.13% task success rate, a 14.61% absolute gain over the base model, and 7.34% higher than open-source SOTA.
arXiv Detail & Related papers (2025-09-28T13:19:20Z)
AWorld: Orchestrating the Training Recipe for Agentic AI [35.94278765364194]
We introduce AWorld, an open-source system engineered for large-scale agent-environment interaction.<n>By distributing tasks across a cluster, AWorld accelerates experience collection by 14.6x compared to standard single-node, sequential execution.<n>We trained a Qwen3-32B-based agent that achieves pass@1 accuracy of 32.23% on the GAIA test set.
arXiv Detail & Related papers (2025-08-28T04:04:30Z)
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models [33.1538965735133]
Cybench is a framework for specifying cybersecurity tasks and evaluating agents on those tasks.<n>We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions.<n>We construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct.
arXiv Detail & Related papers (2024-08-15T17:23:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.