Measuring Agents in Production
- URL: http://arxiv.org/abs/2512.04123v1
- Date: Tue, 02 Dec 2025 16:45:10 GMT
- Title: Measuring Agents in Production
- Authors: Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro Basile, Shuyi Yang, Paul Castro, Daniel Kang, Joseph E. Gonzalez, Koushik Sen, Dawn Song, Ion Stoica, Matei Zaharia, Marquita Ellis,
- Abstract summary: We present the first large-scale systematic study of AI agents in production.<n>We find that production agents are typically built using simple, controllable approaches.<n> Reliability remains the top development challenge, driven by difficulties in ensuring and evaluating agent correctness.
- Score: 133.77818981073457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: AI agents are actively running in production across diverse industries, yet little is publicly known about which technical approaches enable successful real-world deployments. We present the first large-scale systematic study of AI agents in production, surveying 306 practitioners and conducting 20 in-depth case studies via interviews across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and what the top development challenges are. We find that production agents are typically built using simple, controllable approaches: 68% execute at most 10 steps before requiring human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability remains the top development challenge, driven by difficulties in ensuring and evaluating agent correctness. Despite these challenges, simple yet effective methods already enable agents to deliver impact across diverse industries. Our study documents the current state of practice and bridges the gap between research and deployment by providing researchers visibility into production challenges while offering practitioners proven patterns from successful deployments.
Related papers
- The Role of AI in Modern Penetration Testing [0.0]
Penetration testing is a cornerstone of cybersecurity, traditionally driven by manual, time-intensive processes.<n>This systematic literature review examines how Artificial Intelligence (AI) is reshaping penetration testing.
arXiv Detail & Related papers (2025-12-13T13:34:31Z) - AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite [75.58737079136942]
We present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research.<n>Our suite comes with the first scientific research environment with production-grade search tools.<n>Our evaluation of 57 agents across 22 agent classes reveals several interesting findings.
arXiv Detail & Related papers (2025-10-24T17:10:26Z) - An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications [12.166151903597445]
Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains.<n>Their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges.<n>We conduct the first large-scale empirical study of testing practices in the AI agent ecosystem.
arXiv Detail & Related papers (2025-09-23T16:02:09Z) - LIMI: Less is More for Agency [49.63355240818081]
LIMI (Less Is More for Intelligent Agency) demonstrates that agency follows radically different development principles.<n>We show that sophisticated agentic intelligence can emerge from minimal but strategically curated demonstrations of autonomous behavior.<n>Our findings establish the Agency Efficiency Principle: machine autonomy emerges not from data abundance but from strategic curation of high-quality agentic demonstrations.
arXiv Detail & Related papers (2025-09-22T10:59:32Z) - Collaborative Gym: A Framework for Enabling and Evaluating Human-Agent Collaboration [50.657070334404835]
Collaborative Gym is a framework enabling asynchronous, tripartite interaction among agents, humans, and task environments.<n>We instantiate Co-Gym with three representative tasks in both simulated and real-world conditions.<n>Our findings reveal that collaborative agents consistently outperform their fully autonomous counterparts in task performance.
arXiv Detail & Related papers (2024-12-20T09:21:15Z) - Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance [95.03771007780976]
We tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions.<n>First, we collect real-world human activities to generate proactive task predictions.<n>These predictions are labeled by human annotators as either accepted or rejected.<n>The labeled data is used to train a reward model that simulates human judgment.
arXiv Detail & Related papers (2024-10-16T08:24:09Z) - SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.