CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments
- URL: http://arxiv.org/abs/2602.14229v1
- Date: Sun, 15 Feb 2026 16:54:34 GMT
- Title: CORPGEN: Simulating Corporate Environments with Autonomous Digital Employees in Multi-Horizon Task Environments
- Authors: Abubakarr Jaye, Nigel Boachie Kumankumah, Chidera Biringa, Anjel Shaileshbhai Patel, Sulaiman Vesal, Dayquan Julienne, Charlotte Siska, Manuel Raúl Meléndez Luján, Anthony Twum-Barimah, Mauricio Velazco, Tianwei Chen,
- Abstract summary: Real organizational work requires managing many concurrent long-horizon tasks with interleaving, dependencies, and reprioritization.<n>We introduce Multi-Horizon Task Environments (MHTEs): a distinct problem class requiring coherent execution across dozens of interleaved tasks.<n>We identify four failure modes that cause baseline CUAs to degrade from 16.7% to 8.7% completion as load scales 25% to 100%.<n>We present CorpGen, an architecture-agnostic framework addressing these failures via hierarchical planning for multi-horizon goal alignment.
- Score: 1.6153514666902042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-horizon reasoning is a key challenge for autonomous agents, yet existing benchmarks evaluate agents on single tasks in isolation. Real organizational work requires managing many concurrent long-horizon tasks with interleaving, dependencies, and reprioritization. We introduce Multi-Horizon Task Environments (MHTEs): a distinct problem class requiring coherent execution across dozens of interleaved tasks (45+, 500-1500+ steps) within persistent execution contexts spanning hours. We identify four failure modes that cause baseline CUAs to degrade from 16.7% to 8.7% completion as load scales 25% to 100%, a pattern consistent across three independent implementations. These failure modes are context saturation (O(N) vs O(1) growth), memory interference, dependency complexity (DAGs vs. chains), and reprioritization overhead. We present CorpGen, an architecture-agnostic framework addressing these failures via hierarchical planning for multi-horizon goal alignment, sub-agent isolation preventing cross-task contamination, tiered memory (working, structured, semantic), and adaptive summarization. CorpGen simulates corporate environments through digital employees with persistent identities and realistic schedules. Across three CUA backends (UFO2, OpenAI CUA, hierarchical) on OSWorld Office, CorpGen achieves up to 3.5x improvement over baselines (15.2% vs 4.3%) with stable performance under increasing load, confirming that gains stem from architectural mechanisms rather than specific CUA implementations. Ablation studies show experiential learning provides the largest gains.
Related papers
- MagicAgent: Towards Generalized Agent Planning [73.21129030631421]
We present textbfMagicAgent, a series of foundation models specifically designed for generalized agent planning.<n>We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks.<n>We show that MagicAgent-32B and MagicAgent-30B-A3B achieve superior performance across diverse open-source benchmarks.
arXiv Detail & Related papers (2026-02-22T01:39:16Z) - QP-OneModel: A Unified Generative LLM for Multi-Task Query Understanding in Xiaohongshu Search [19.376785819604923]
QP-OneModel bridges user intent and content supply in SNS search engines.<n>It generates intent descriptions as a novel high-fidelity semantic signal.<n>It also exhibits superior generalization, surpassing a 32B model by 7.60% accuracy on unseen tasks.
arXiv Detail & Related papers (2026-02-10T15:38:17Z) - daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently [35.39097522391409]
Large Language Models (LLMs) excel at short-term tasks, scaling them to long-horizon agentic synthesis remains challenging.<n>We propose daVinci-Agency, which systematically mines structured supervision from chain-of-PRs.<n>DaVinci-Agency's PR-grounded structure inherently preserves the causal dependencies and iterative refinements essential for teaching persistent goal-cycle behavior.
arXiv Detail & Related papers (2026-02-02T13:23:39Z) - Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z) - SCUBA: Salesforce Computer Use Benchmark [63.66753028386581]
SCUBA is a benchmark designed to evaluate computer-use agents on customer relationship management ( CRM) within the Salesforce platform.<n> SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents.<n>We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings.
arXiv Detail & Related papers (2025-09-30T16:48:49Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z) - AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need [35.88121318813734]
Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains.<n>We introduce AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations.<n>Experiments demonstrate AgentGroupChat-V2's superior performance across diverse domains.
arXiv Detail & Related papers (2025-06-18T13:24:04Z) - Exposing and Addressing Cross-Task Inconsistency in Unified
Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users.
We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks.
We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z) - Dynamic Multi-Robot Task Allocation under Uncertainty and Temporal
Constraints [52.58352707495122]
We present a multi-robot allocation algorithm that decouples the key computational challenges of sequential decision-making under uncertainty and multi-agent coordination.
We validate our results over a wide range of simulations on two distinct domains: multi-arm conveyor belt pick-and-place and multi-drone delivery dispatch in a city.
arXiv Detail & Related papers (2020-05-27T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.