Related papers: Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

URL: http://arxiv.org/abs/2602.08915v1
Date: Mon, 09 Feb 2026 17:14:46 GMT
Title: Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance
Authors: Giovanni Pinna, Jingzhi Gong, David Williams, Federica Sarro,
Abstract summary: This paper compares five popular AI-powered coding assistants (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code)<n>Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks)<n>Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates.
Score: 4.424336158797069
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).

Related papers

AgentAssay: Token-Efficient Regression Testing for Non-Deterministic AI Agent Workflows [0.0]
AgentAssay is the first token-efficient framework for regression testing non-deterministic AI agents.<n>It achieves 78-100% cost reduction while maintaining rigorous statistical guarantees.
arXiv Detail & Related papers (2026-03-03T04:59:25Z)
Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical Study on Decision Framework [0.26508608365976566]
In an empirical study of 25 GitHub projects, we found that progressive prompting achieves 96.9% average task completion.<n>Self-critique succeeds on code-reviewable logic errors but fails completely on external service integration.<n>RAG achieves highest completion across all failure types with superior efficiency.
arXiv Detail & Related papers (2026-02-02T23:08:03Z)
AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios [49.90735676070039]
The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow.<n>We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks.<n>We propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks.
arXiv Detail & Related papers (2026-01-28T13:49:18Z)
Code Change Characteristics and Description Alignment: A Comparative Study of Agentic versus Human Pull Requests [0.0]
We analyze 33,596 agent-generated PRs and 6,618 human PRs to compare code-change characteristics and message quality.<n>Agents generate stronger commit-level messages but lag humans at PR-level summarization.<n>These findings highlight a gap between agents' micro-level precision and macro-level communication.
arXiv Detail & Related papers (2026-01-24T23:33:07Z)
AI, Metacognition, and the Verification Bottleneck: A Three-Wave Longitudinal Study of Human Problem-Solving [0.0]
This pilot study tracked how generative AI reshapes problem-solving over six months in an academic setting.<n>Results generalize primarily to early-adopter, academically affiliated populations.
arXiv Detail & Related papers (2026-01-21T15:49:04Z)
Multi-Agent LLM Committees for Autonomous Software Beta Testing [0.0]
The framework combines model diversity, persona-driven behavioral variation, and visual user interface understanding.<n>Vision-enabled agents successfully identify user interface elements, with navigation and reporting achieving 100 percent success.<n>The framework enables reproducible research and practical deployment of LLM-based software testing in CI/CD pipelines.
arXiv Detail & Related papers (2025-12-21T02:06:53Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation [87.47155146067962]
We provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of tasks.<n>We conduct three-dimensional analysis spanning models, scaffolds, and benchmarks.<n>Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs.
arXiv Detail & Related papers (2025-10-13T22:22:28Z)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub [6.7302091035327285]
Large language models (LLMs) are increasingly being integrated into software development processes.<n>The ability to generate code and submit pull requests with minimal human intervention, through the use of autonomous AI agents, is poised to become a standard practice.<n>We empirically study 567 GitHub pull requests (PRs) generated using Claude Code, an agentic coding tool, across 157 open-source projects.
arXiv Detail & Related papers (2025-09-18T08:48:32Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.