Related papers: Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study

Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study

URL: http://arxiv.org/abs/2602.00164v1
Date: Thu, 29 Jan 2026 22:06:58 GMT
Title: Why Are AI Agent Involved Pull Requests (Fix-Related) Remain Unmerged? An Empirical Study
Authors: Khairul Alam, Saikat Mondal, Banani Roy,
Abstract summary: We analyze 8,106 fix related PRs authored by five widely used AI coding agents from the AIDEV POP dataset.<n>Our results indicate that test case failures and prior resolution of the same issues by other PRs are the most common causes of non integration.
Score: 5.127121704630949
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous coding agents (e.g., OpenAI Codex, Devin, GitHub Copilot) are increasingly used to generate fix-related pull requests (PRs) in real world software repositories. However, their practical effectiveness depends on whether these contributions are accepted and merged by project maintainers. In this paper, we present an empirical study of AI agent involved fix related PRs, examining both their integration outcomes, latency, and the factors that hinder successful merging. We first analyze 8,106 fix related PRs authored by five widely used AI coding agents from the AIDEV POP dataset to quantify the proportions of PRs that are merged, closed without merging, or remain open. We then conduct a manual qualitative analysis of a statistically significant sample of 326 closed but unmerged PRs, spending approximately 100 person hours to construct a structured catalog of 12 failure reasons. Our results indicate that test case failures and prior resolution of the same issues by other PRs are the most common causes of non integration, whereas build or deployment failures are comparatively rare. Overall, our findings expose key limitations of current AI coding agents in real world settings and highlight directions for their further improvement and for more effective human AI collaboration in software maintenance.

Related papers

Let's Make Every Pull Request Meaningful: An Empirical Analysis of Developer and Agentic Pull Requests [0.944838645453772]
We conduct a large-scale empirical analysis of 40,214 PRs collected from the AIDev dataset.<n>We extract 64 features across six families and fit statistical regression models to compare PR merge outcomes for human and agentic PRs.<n>Our results show that submitter attributes dominate merge outcomes for both groups, while review-related features exhibit contrasting effects between human and agentic PRs.
arXiv Detail & Related papers (2026-01-26T18:16:10Z)
Where Do AI Coding Agents Fail? An Empirical Study of Failed Agentic Pull Requests in GitHub [5.808464460707249]
We conduct a large-scale study of 33k agent-authored PRs made by five coding agents across GitHub.<n>We first quantitatively characterize merged and not-merged PRs along four broad dimensions.<n>Not-merged PRs tend to involve larger code changes, touch more files, and often do not pass the project's CI/CD pipeline validation.
arXiv Detail & Related papers (2026-01-21T17:12:46Z)
On Autopilot? An Empirical Study of Human-AI Teaming and Review Practices in Open Source [11.412808537439973]
We investigated project-level guidelines and developers' interactions with AI-assisted pull requests (PRs)<n>We found that over 67.5% of AI-co-authored PRs originate from contributors without prior code ownership.<n>In contrast to human-created PRs where non-owner developers receive the most feedback, AI-co-authored PRs from non-owners receive the least.
arXiv Detail & Related papers (2026-01-20T09:09:53Z)
Early-Stage Prediction of Review Effort in AI-Generated Pull Requests [0.0]
We analyze 33,707 agent-authored PRs from the AIDev dataset across 2,807 repositories.<n>We propose a Circuit Breaker triage model that predicts high-review-effort PRs at creation time.
arXiv Detail & Related papers (2026-01-02T17:18:01Z)
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation [87.47155146067962]
We provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of tasks.<n>We conduct three-dimensional analysis spanning models, scaffolds, and benchmarks.<n>Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs.
arXiv Detail & Related papers (2025-10-13T22:22:28Z)
Barbarians at the Gate: How AI is Upending Systems Research [58.95406995634148]
We argue that systems research, long focused on designing and evaluating new performance-oriented algorithms, is particularly well-suited for AI-driven solution discovery.<n>We term this approach as AI-Driven Research for Systems ( ADRS), which iteratively generates, evaluates, and refines solutions.<n>Our results highlight both the disruptive potential and the urgent need to adapt systems research practices in the age of AI.
arXiv Detail & Related papers (2025-10-07T17:49:24Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub [6.7302091035327285]
Large language models (LLMs) are increasingly being integrated into software development processes.<n>The ability to generate code and submit pull requests with minimal human intervention, through the use of autonomous AI agents, is poised to become a standard practice.<n>We empirically study 567 GitHub pull requests (PRs) generated using Claude Code, an agentic coding tool, across 157 open-source projects.
arXiv Detail & Related papers (2025-09-18T08:48:32Z)
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving [62.71545696485824]
We introduce AGENT KB, a universal memory infrastructure enabling seamless experience sharing across heterogeneous agent frameworks without retraining.<n>AGENT KB aggregates trajectories into a structured knowledge base and serves lightweight APIs.<n>We validate AGENT across major frameworks on GAIA, Humanity's Last Exam, GPQA, and SWE-bench.
arXiv Detail & Related papers (2025-07-08T17:59:22Z)
When Disagreements Elicit Robustness: Investigating Self-Repair Capabilities under LLM Multi-Agent Disagreements [56.29265568399648]
We argue that disagreements prevent premature consensus and expand the explored solution space.<n>Disagreements on task-critical steps can derail collaboration depending on the topology of solution paths.
arXiv Detail & Related papers (2025-02-21T02:24:43Z)
Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning [50.47568731994238]
Key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL) This paper presents a general framework model for integrating and learning structured reasoning into AI agents' policies.
arXiv Detail & Related papers (2023-12-22T17:57:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.