Related papers: How Well Does Agent Development Reflect Real-World Work?

How Well Does Agent Development Reflect Real-World Work?

URL: http://arxiv.org/abs/2603.01203v1
Date: Sun, 01 Mar 2026 17:55:49 GMT
Title: How Well Does Agent Development Reflect Real-World Work?
Authors: Zora Zhiruo Wang, Sanidhya Vijayvargiya, Aspen Chen, Hanmo Zhang, Venu Arvind Arangarajan, Jett Chen, Valerie Chen, Diyi Yang, Daniel Fried, Graham Neubig,
Abstract summary: We study the relationship between agent development efforts and the distribution of real-world human work by mapping benchmark instances to work domains and skills.<n>We reveal substantial mismatches between agent development that tends to be programming-centric, and the categories in which human labor and economic value are concentrated.
Score: 89.17217057358285
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: AI agents are increasingly developed and evaluated on benchmarks relevant to human work, yet it remains unclear how representative these benchmarking efforts are of the labor market as a whole. In this work, we systematically study the relationship between agent development efforts and the distribution of real-world human work by mapping benchmark instances to work domains and skills. We first analyze 43 benchmarks and 72,342 tasks, measuring their alignment with human employment and capital allocation across all 1,016 real-world occupations in the U.S. labor market. We reveal substantial mismatches between agent development that tends to be programming-centric, and the categories in which human labor and economic value are concentrated. Within work areas that agents currently target, we further characterize current agent utility by measuring their autonomy levels, providing practical guidance for agent interaction strategies across work scenarios. Building on these findings, we propose three measurable principles for designing benchmarks that better capture socially important and technically challenging forms of work: coverage, realism, and granular evaluation.

Related papers

Agentic Reasoning for Large Language Models [122.81018455095999]
Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making.<n>Large language models (LLMs) demonstrate strong reasoning capabilities in closed-world settings, but struggle in open-ended and dynamic environments.<n>Agentic reasoning marks a paradigm shift by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction.
arXiv Detail & Related papers (2026-01-18T18:58:23Z)
Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia [100.74015791021044]
Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction.<n>Existing evaluation methods fail to measure how well these capabilities generalize to novel social situations.<n>We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains.
arXiv Detail & Related papers (2025-12-03T00:11:05Z)
Benchmarking LLM Agents for Wealth-Management Workflows [0.0]
This dissertation extends TheAgentCompany with a finance-focused environment.<n>It investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically.
arXiv Detail & Related papers (2025-12-01T21:56:21Z)
UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI [2.0619484032730813]
UpBench is a benchmark grounded in real jobs drawn from the global Upwork labor marketplace.<n>Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes.<n>UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback.
arXiv Detail & Related papers (2025-11-15T17:39:37Z)
Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents [1.0305173936249623]
This white paper proposes a novel framework of eleven outcome-based, task-agnostic performance metrics for AI agents.<n>We introduce metrics such as Goal Completion Rate (GCR), Autonomy Index (AIx), Multi-Step Task Resilience (MTR), and Business Impact Efficiency (BIE)<n>Our results reveal significant performance trade-offs between different agent designs, highlighting the Hybrid Agent as the most consistently high-performing model.
arXiv Detail & Related papers (2025-11-11T13:40:46Z)
A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks [14.762911285395047]
We evaluate seven general-purpose agent frameworks across three representative code-centric tasks.<n>Our findings reveal distinct capability patterns and trade-offs among the evaluated frameworks.<n>For overhead, software development incurs the highest monetary cost, while GPTswarm remains the most cost-efficient.
arXiv Detail & Related papers (2025-11-02T09:46:59Z)
How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations [112.57167042285437]
We study how agents do human work by presenting the first direct comparison of human and agent workers.<n>We find that agents deliver results 88.3% faster and cost 90.4-96.2% less than humans.
arXiv Detail & Related papers (2025-10-26T18:10:22Z)
Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration [50.657070334404835]
Collaborative Gym is a framework enabling asynchronous, tripartite interaction among agents, humans, and task environments.<n>We instantiate Co-Gym with three representative tasks in both simulated and real-world conditions.<n>Our findings reveal that collaborative agents consistently outperform their fully autonomous counterparts in task performance.
arXiv Detail & Related papers (2024-12-20T09:21:15Z)
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments [90.29937153770835]
We introduce CRMArena, a benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments.<n>We show that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities.<n>Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments.
arXiv Detail & Related papers (2024-11-04T17:30:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.