Related papers: A Task-Level Evaluation of AI Agents in Open-Source Projects

A Task-Level Evaluation of AI Agents in Open-Source Projects

URL: http://arxiv.org/abs/2602.02345v1
Date: Mon, 02 Feb 2026 17:05:19 GMT
Title: A Task-Level Evaluation of AI Agents in Open-Source Projects
Authors: Shojibur Rahman, Md Fazle Rabbi, Minhaz Zibran,
Abstract summary: We present a comparative study of five autonomous coding agents using AIDev-pop.<n>We evaluate agents' performance along three task-aware dimensions spanning the PR lifecycle.<n>Our findings inform selection and improvements of AI agents for their effective integration to collaborative software engineering.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we present a comparative study of five autonomous coding agents using AIDev-pop, which is a public dataset containing thousands of AI-generated pull requests (PRs) across popular open-source repositories. We evaluate agents' performance along three task-aware dimensions spanning the PR lifecycle: (1) PR acceptance rate, (2) review discussion volume, and (3) commit message quality. Our quantitative analysis finds that Codex consistently achieves high PR acceptance rates across most task categories, while Copilot's PRs trigger the highest volume of both human and automated review discussions. In contrast, commit-level quality varies independently of acceptance outcomes. Claude and Cursor produce higher proportions of high-quality commit messages across several task types, and Codex exhibiting comparatively lower commit quality despite strong integration outcomes. Our findings inform selection and improvements of AI agents for their effective integration to collaborative software engineering.

Related papers

When AI Teammates Meet Code Review: Collaboration Signals Shaping the Integration of Agent-Authored Pull Requests [0.0]
We study integration outcomes, resolution speed, and review-time collaboration signals using the public AIDev dataset.<n>We find that reviewer engagement has the strongest correlation with successful integration, whereas larger change sizes and coordination-disrupting actions are associated with a lower likelihood of merging.
arXiv Detail & Related papers (2026-02-23T02:20:56Z)
Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study [5.127121704630949]
We analyze 8,106 fix related PRs authored by five widely used AI coding agents from the AIDEV POP dataset.<n>Our results indicate that test case failures and prior resolution of the same issues by other PRs are the most common causes of non integration.
arXiv Detail & Related papers (2026-01-29T22:06:58Z)
AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios [49.90735676070039]
The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow.<n>We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks.<n>We propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks.
arXiv Detail & Related papers (2026-01-28T13:49:18Z)
Let's Make Every Pull Request Meaningful: An Empirical Analysis of Developer and Agentic Pull Requests [0.944838645453772]
We conduct a large-scale empirical analysis of 40,214 PRs collected from the AIDev dataset.<n>We extract 64 features across six families and fit statistical regression models to compare PR merge outcomes for human and agentic PRs.<n>Our results show that submitter attributes dominate merge outcomes for both groups, while review-related features exhibit contrasting effects between human and agentic PRs.
arXiv Detail & Related papers (2026-01-26T18:16:10Z)
Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub [5.808464460707249]
We conduct a large-scale study of 33k agent-authored PRs made by five coding agents across GitHub.<n>We first quantitatively characterize merged and not-merged PRs along four broad dimensions.<n>Not-merged PRs tend to involve larger code changes, touch more files, and often do not pass the project's CI/CD pipeline validation.
arXiv Detail & Related papers (2026-01-21T17:12:46Z)
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation [87.47155146067962]
We provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of tasks.<n>We conduct three-dimensional analysis spanning models, scaffolds, and benchmarks.<n>Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs.
arXiv Detail & Related papers (2025-10-13T22:22:28Z)
How can we assess human-agent interactions? Case studies in software agent design [52.953425368394306]
We make two major steps towards the rigorous assessment of human-agent interactions.<n>We propose PULSE, a framework for more efficient human-centric evaluation of agent designs.<n>We deploy the framework on a large-scale web platform built around the open-source software agent OpenHands.
arXiv Detail & Related papers (2025-10-10T19:04:28Z)
AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation [31.02336903452371]
AirQA is a human-annotated comprehensive paper QA dataset in the field of artificial intelligence (AI)<n>With three LLM-based agents, ExTrActor can perform example generation and trajectory collection without human intervention.<n>ExTrActor consistently improves the multi-turn tool-use capability of small models, enabling them to achieve performance comparable to larger ones.
arXiv Detail & Related papers (2025-09-21T07:24:17Z)
Code with Me or for Me? How Increasing AI Automation Transforms Developer Workflows [60.04362496037186]
We present the first controlled study of developer interactions with coding agents.<n>We evaluate two leading copilot and agentic coding assistants.<n>Our results show agents can assist developers in ways that surpass copilots.
arXiv Detail & Related papers (2025-07-10T20:12:54Z)
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents [59.825725526176655]
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents.<n>Existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition.<n>We introduce MultiAgentBench, a benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios.
arXiv Detail & Related papers (2025-03-03T05:18:50Z)
Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions [11.620351603683496]
GitHub's Copilot for Pull Requests (PRs) is a promising service aiming to automate various developer tasks related to PRs. In this study, we examine 18,256 PRs in which parts of the descriptions were crafted by generative AI. Our findings indicate that Copilot for PRs, though in its infancy, is seeing a marked uptick in adoption.
arXiv Detail & Related papers (2024-02-14T06:20:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.