OrgAgent: Organize Your Multi-Agent System like a Company
Abstract Overview
This paper introduces OrgAgent, a company-style hierarchical multi-agent system that separates collaboration into governance, execution, and compliance layers. The framework defines corporate-inspired roles (CEO, CTO, COO, Drafter, Reviewer, Specialist, CSO, CCO) and supports multiple execution modes (DIRECT, LIGHT MAS, FULL MAS) and policies (STRICT, BALANCE, NOCAP, AUTO). The authors evaluate hierarchical and flat organizations across MuSiQue, MuSR, and SQuAD 2.0 using GPT-5 mini, GPT-OSS-120B, and Llama 3.1 8B. Results indicate that hierarchical organization generally improves performance over flat and single-agent baselines on MuSiQue and SQuAD 2.0, while also reducing token consumption relative to flat collaboration in all reported settings, though results on MuSR are mixed.
Novelty
The paper treats organizational structure itself as the central variable in multi-agent system design and evaluation, rather than focusing solely on local interaction mechanisms. It proposes a corporate-style hierarchy with explicit governance, execution, and compliance layers, combined with configurable execution modes and policies, and provides the first systematic empirical comparison of flat versus hierarchical MAS on general reasoning tasks.
Results
Hierarchical OrgAgent achieves the strongest results on MuSiQue and SQuAD 2.0 for all three tested models, with reported gains over flat MAS ranging from +18.97% to +123.99% on MuSiQue and +58.96% to +120.47% on SQuAD 2.0. It consistently uses fewer tokens than flat MAS, with reductions ranging from 46.38% to 79.31% across all benchmarks and models. However, on MuSR, flat organization outperforms hierarchical coordination for GPT-OSS-120B and LLaMA-3.1-8B.
Key Points
- OrgAgent structures multi-agent reasoning into governance, execution, and compliance layers with distinct corporate-style roles, a skill-based worker pool, and configurable execution modes and policies.
- Hierarchical coordination outperforms flat collaboration on MuSiQue and SQuAD 2.0 across all three models while consistently reducing token usage by 46–79%, though on MuSR flat organization remains better for GPT-OSS-120B and LLaMA-3.1-8B.
- Coordination behavior analysis reveals model-dependent skill specialization patterns and substantially higher abstention rates (up to 39.78%) on unanswerable SQuAD 2.0 questions under hierarchical policies compared to near-zero abstention in flat and baseline settings.