Related papers: EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents

EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents

URL: http://arxiv.org/abs/2601.17722v1
Date: Sun, 25 Jan 2026 06:58:15 GMT
Title: EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents
Authors: Ying Mo, Yu Bai, Dapeng Sun, Yuqian Shi, Yukai Miao, Li Chen, Dan Li,
Abstract summary: EntWorld is a large-scale benchmark consisting of 1,756 tasks across six representative enterprise domains.<n>We propose a schema-grounded task generation framework that directly reverse-engineers business logic from underlying database schemas.<n>We show that state-of-the-art models achieve 47.61% success rate on EntWorld, substantially lower than the human performance.
Score: 12.7922877987936
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enabled agents to operate in open-ended web and operating system environments. However, existing benchmarks predominantly target consumer-oriented scenarios (e.g., e-commerce and travel booking), failing to capture the complexity and rigor of professional enterprise workflows. Enterprise systems pose distinct challenges, including high-density user interfaces, strict business logic constraints, and a strong reliance on precise, state-consistent information retrieval-settings in which current generalist agents often struggle. To address this gap, we introduce EntWorld, a large-scale benchmark consisting of 1,756 tasks across six representative enterprise domains, including customer relationship management (CRM), information technology infrastructure library (ITIL), and enterprise resource planning (ERP) systems. Unlike previous datasets that depend on fragile execution traces or extensive manual annotation, EntWorld adopts a schema-grounded task generation framework that directly reverse-engineers business logic from underlying database schemas, enabling the synthesis of realistic, long-horizon workflows. Moreover, we propose a SQL-based deterministic verification mechanism in building datasets that replaces ambiguous visual matching with rigorous state-transition validation. Experimental results demonstrate that state-of-the-art models (e.g., GPT-4.1) achieve 47.61% success rate on EntWorld, substantially lower than the human performance, highlighting a pronounced enterprise gap in current agentic capabilities and the necessity of developing domain-specific agents. We release EntWorld as a rigorous testbed to facilitate the development and evaluation of the next generation of enterprise-ready digital agents.

Related papers

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents [25.60249598832918]
FT-Dojo is an interactive environment comprising 13 tasks across 5 domains.<n>We develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback.
arXiv Detail & Related papers (2026-03-02T10:37:11Z)
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z)
LLM and Agent-Driven Data Analysis: A Systematic Approach for Enterprise Applications and System-level Deployment [17.572976426351318]
Generative AI and Agent technologies are transforming enterprise data management and analytics.<n>Traditional database applications and system deployment are fundamentally impacted by AI-driven tools.<n>Data security and compliance are top priorities for organizations adopting AI technologies.
arXiv Detail & Related papers (2025-11-21T07:16:31Z)
CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories [15.512057716487517]
We propose CRMWeaver, a novel approach that enhances business agents in complex settings.<n>We employ a synthesis data generation and RL-based paradigm during training, which significantly improves the model's ability to handle complex data.<n>We validate the efficacy of our approach on the CRMArena-Pro dataset, underscoring its practical value for real-world applications.
arXiv Detail & Related papers (2025-10-29T09:47:40Z)
Affordance Representation and Recognition for Autonomous Agents [64.39018305018904]
This paper introduces a pattern language for world modeling from structured data.<n>The DOM Transduction Pattern addresses the challenge of web page complexity.<n>The Hypermedia Affordances Recognition Pattern enables the agent to dynamically enrich its world model.
arXiv Detail & Related papers (2025-10-28T14:27:28Z)
A Survey of Data Agents: Emerging Paradigm or Overstated Hype? [66.1526688475023]
"Data agent" currently suffers from terminological ambiguity and inconsistent adoption.<n>This survey introduces the first systematic hierarchical taxonomy for data agents.<n>We conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.
arXiv Detail & Related papers (2025-10-27T17:54:07Z)
Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics [75.4712507893024]
Enterprise Deep Research (EDR) is a multi-agent system that integrates a Master Planning Agent for adaptive query decomposition.<n>Four specialized search agents (General, Academic, GitHub, LinkedIn) and a visualization agent for data-driven insights are also included.<n>EDR reflects research direction with optional human-in-the-loop steering guidance.
arXiv Detail & Related papers (2025-10-20T17:55:11Z)
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.00932417210477]
We introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers.<n>Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.<n>We find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations.
arXiv Detail & Related papers (2025-08-20T13:28:58Z)
Structured Agentic Workflows for Financial Time-Series Modeling with LLMs and Reflective Feedback [16.04516547661581]
Time-series data is central to decision-making in financial markets, yet building high-performing, interpretable, and auditable models remains a major challenge.<n>textsfTSAgent is a modular agentic framework designed to automate and enhance time-series modeling for financial applications.
arXiv Detail & Related papers (2025-08-19T15:14:49Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
FinRobot: Generative Business Process AI Agents for Enterprise Resource Planning in Finance [6.494553545846438]
We present the first AI-native framework for ERP systems, introducing a novel architecture of Generative Business Process AI Agents.<n>The proposed system integrates generative AI with business process modeling and multi-agent orchestration, enabling end-to-end automation.<n>We show that GBPAs achieve up to 40% reduction in processing time, 94% drop in error rate, and improved regulatory compliance.
arXiv Detail & Related papers (2025-06-02T08:22:28Z)
Orchestrating Agents and Data for Enterprise: A Blueprint Architecture for Compound AI [11.859180018313147]
We propose a 'blueprint architecture' for compound AI systems for orchestrating agents and data for enterprise applications.<n>Existing proprietary models and APIs in the enterprise are mapped to 'agents', defined in an 'agent registry'<n>Agents can utilize proprietary data through a 'data registry' that similarly registers enterprise data of various modalities.
arXiv Detail & Related papers (2025-04-10T22:19:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.