From Correctness to Collaboration: Toward a Human-Centered Framework for Evaluating AI Agent Behavior in Software Engineering
- URL: http://arxiv.org/abs/2512.23844v1
- Date: Mon, 29 Dec 2025 20:18:57 GMT
- Title: From Correctness to Collaboration: Toward a Human-Centered Framework for Evaluating AI Agent Behavior in Software Engineering
- Authors: Tao Dong, Harini Sampath, Ja Young Lee, Sherry Y. Shi, Andrew Macvean,
- Abstract summary: Current benchmarks, focused on code correctness, fail to capture the nuanced, interactive behaviors essential for successful human-AI partnership.<n>We present a foundational taxonomy of desirable agent behaviors for enterprise software engineering.<n>We also introduce the Context-Adaptive Behavior (CAB) Framework.
- Score: 7.402388519535592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) evolve from code generators into collaborative partners for software engineers, our methods for evaluation are lagging. Current benchmarks, focused on code correctness, fail to capture the nuanced, interactive behaviors essential for successful human-AI partnership. To bridge this evaluation gap, this paper makes two core contributions. First, we present a foundational taxonomy of desirable agent behaviors for enterprise software engineering, derived from an analysis of 91 sets of user-defined agent rules. This taxonomy defines four key expectations of agent behavior: Adhere to Standards and Processes, Ensure Code Quality and Reliability, Solving Problems Effectively, and Collaborating with the User. Second, recognizing that these expectations are not static, we introduce the Context-Adaptive Behavior (CAB) Framework. This emerging framework reveals how behavioral expectations shift along two empirically-derived axes: the Time Horizon (from immediate needs to future ideals), established through interviews with 15 expert engineers, and the Type of Work (from enterprise production to rapid prototyping, for example), identified through a prompt analysis of a prototyping agent. Together, these contributions offer a human-centered foundation for designing and evaluating the next generation of AI agents, moving the field's focus from the correctness of generated code toward the dynamics of true collaborative intelligence.
Related papers
- ClarEval: A Benchmark for Evaluating Clarification Skills of Code Agents under Ambiguous Instructions [19.875754116636436]
We introduce ClarEval, a framework designed to assess an agent's "Collaborative Quotient" by simulating the inherent ambiguity of human communication.<n>To quantify this capability, we propose a metric suite led by Average Turns to Clarify coders (ATC) and Key Question Coverage (KQC)<n>Our experiments on eleven state-of-the-art agents reveal a stark reality: while models like GPT-5-Coder excel at coding, they often lack the strategic communication skills required for efficient partnership.
arXiv Detail & Related papers (2026-02-27T01:10:27Z) - EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots [68.29056647487519]
Embodied AI is fueled by high-fidelity simulation and large-scale data collection.<n>However, this scaling capability remains bottlenecked by a reliance on labor-intensive manual oversight.<n>We introduce textscEmboCoach-Bench, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies.
arXiv Detail & Related papers (2026-01-29T11:33:49Z) - AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios [49.90735676070039]
The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow.<n>We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks.<n>We propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks.
arXiv Detail & Related papers (2026-01-28T13:49:18Z) - LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z) - Agentsway -- Software Development Methodology for AI Agents-based Teams [4.226647687395254]
"Agentsway" is a novel software development framework designed for ecosystems where AI agents operate as first-class collaborators.<n>The framework defines distinct roles for planning, prompting, coding, testing, and fine-tuning agents.<n>Agentsway represents a foundational step toward the next generation of AI-native, self-improving software development methodologies.
arXiv Detail & Related papers (2025-10-26T11:58:42Z) - A Survey of Vibe Coding with Large Language Models [93.88284590533242]
"Vibe Coding" is a development methodology where developers validate AI-generated implementations through outcome observation.<n>Despite its transformative potential, the effectiveness of this emergent paradigm remains under-explored.<n>This survey provides the first comprehensive and systematic review of Vibe Coding with large language models.
arXiv Detail & Related papers (2025-10-14T11:26:56Z) - How can we assess human-agent interactions? Case studies in software agent design [52.953425368394306]
We make two major steps towards the rigorous assessment of human-agent interactions.<n>We propose PULSE, a framework for more efficient human-centric evaluation of agent designs.<n>We deploy the framework on a large-scale web platform built around the open-source software agent OpenHands.
arXiv Detail & Related papers (2025-10-10T19:04:28Z) - Agentic Software Engineering: Foundational Pillars and a Research Roadmap [15.059942573311481]
Agentic Software Engineering (SE 3.0) represents a new era where intelligent agents are tasked with achieving complex, goal-oriented SE objectives.<n>This paper presents the Structured Agentic Software Engineering (SASE) vision, outlining several of the foundational pillars for the future of SE.
arXiv Detail & Related papers (2025-09-07T21:40:10Z) - AI Agentic Programming: A Survey of Techniques, Challenges, and Opportunities [8.086360127362815]
Large language model (LLM)-based coding agents autonomously plan, execute, and interact with tools such as compilers, debuggers, and version control systems.<n>Unlike conventional code generation, these agents decompose goals, coordinate multi-step processes, and adapt based on feedback, reshaping software development practices.
arXiv Detail & Related papers (2025-08-15T00:14:31Z) - Code with Me or for Me? How Increasing AI Automation Transforms Developer Workflows [60.04362496037186]
We present the first controlled study of developer interactions with coding agents.<n>We evaluate two leading copilot and agentic coding assistants.<n>Our results show agents can assist developers in ways that surpass copilots.
arXiv Detail & Related papers (2025-07-10T20:12:54Z) - Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study [15.97770416681533]
Software engineering agents (SWE agents) operate autonomously by interpreting user input and responding to environmental feedback.<n>We present the first systematic study of SWE agent behavior through the lens of execution traces.
arXiv Detail & Related papers (2025-06-10T00:41:54Z) - Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI [0.36868085124383626]
Review presents a comprehensive analysis of two emerging paradigms in AI-assisted software development: vibe coding and agentic coding.<n> Vibe coding emphasizes intuitive, human-in-the-loop interaction through prompt-based, conversational interaction.<n>Agentic coding enables autonomous software development through goal-driven agents capable of planning, executing, testing, and iterating tasks with minimal human intervention.
arXiv Detail & Related papers (2025-05-26T03:00:21Z) - ProAgent: Building Proactive Cooperative Agents with Large Language
Models [89.53040828210945]
ProAgent is a novel framework that harnesses large language models to create proactive agents.
ProAgent can analyze the present state, and infer the intentions of teammates from observations.
ProAgent exhibits a high degree of modularity and interpretability, making it easily integrated into various coordination scenarios.
arXiv Detail & Related papers (2023-08-22T10:36:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.