Related papers: Do Autonomous Agents Contribute Test Code? A Study of Tests in Agentic Pull Requests

Do Autonomous Agents Contribute Test Code? A Study of Tests in Agentic Pull Requests

URL: http://arxiv.org/abs/2601.03556v1
Date: Wed, 07 Jan 2026 03:52:13 GMT
Title: Do Autonomous Agents Contribute Test Code? A Study of Tests in Agentic Pull Requests
Authors: Sabrina Haque, Sarvesh Ingale, Christoph Csallner,
Abstract summary: We present an empirical study of test inclusion in agentic pull requests using the AIDev dataset.<n>Across agents, test-containing PRs are more common over time and tend to be larger and take longer to complete.<n>We also observe variation across agents in both test adoption and the balance between test and production code within test PRs.
Score: 1.2043574473965317
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Testing is a critical practice for ensuring software correctness and long-term maintainability. As agentic coding tools increasingly submit pull requests (PRs), it becomes essential to understand how testing appears in these agent-driven workflows. Using the AIDev dataset, we present an empirical study of test inclusion in agentic pull requests. We examine how often tests are included, when they are introduced during the PR lifecycle and how test-containing PRs differ from non-test PRs in terms of size, turnaround time, and merge outcomes. Across agents, test-containing PRs are more common over time and tend to be larger and take longer to complete, while merge rates remain largely similar. We also observe variation across agents in both test adoption and the balance between test and production code within test PRs. Our findings provide a descriptive view of testing behavior in agentic pull requests and offer empirical grounding for future studies of autonomous software development.

Related papers

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents [20.29427807019999]
Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches.<n>In these, agents often write tests on the fly, a paradigm adopted by many high-ranking agents on the SWE-bench leaderboard.<n>This raises the critical question: whether such tests meaningfully improve issue resolution or merely mimic human testing practices while consuming a substantial interaction budget.
arXiv Detail & Related papers (2026-02-08T10:26:31Z)
Scaling Agentic Verifier for Competitive Coding [66.11758166379092]
Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt.<n>Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling.<n>We propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs.
arXiv Detail & Related papers (2026-02-04T06:30:40Z)
Human-Agent versus Human Pull Requests: A Testing-Focused Characterization and Comparison [0.5794954517255626]
This paper presents an empirical study of 6,582 human-agent PRs (HAPRs) and 3,122 human PRs (HPRs) from the AIDev dataset.<n>We compare HAPRs and HPRs along three dimensions: (i) testing frequency and extent, (ii) types of testing-related changes, and (iii) testing quality, measured by test smells.
arXiv Detail & Related papers (2026-01-29T02:50:02Z)
Automated structural testing of LLM-based agents: methods, framework, and case studies [0.05254956925594667]
LLM-based agents are rapidly being adopted across diverse domains.<n>Current testing approaches focus on acceptance-level evaluation from the user's perspective.<n>We present methods to enable structural testing of LLM-based agents.
arXiv Detail & Related papers (2026-01-25T11:52:30Z)
Change And Cover: Last-Mile, Pull Request-Based Regression Test Augmentation [20.31612139450269]
Testing pull requests (PRs) is critical to maintaining software quality.<n>Some PR-modified lines remain untested, leaving a "last-mile" regression test gap.<n>We present ChaCo, an LLM-based test augmentation technique that addresses this gap.
arXiv Detail & Related papers (2026-01-16T02:08:16Z)
Security in the Age of AI Teammates: An Empirical Study of Agentic Pull Requests on GitHub [4.409447722044799]
This study aims to characterize how autonomous coding agents contribute to software security in practice.<n>We conduct a large-scale empirical analysis of agent-authored PRs using the AIDev dataset.<n>We then analyze prevalence, acceptance outcomes, and review latency across autonomous agents, programming ecosystems, and types of code changes.
arXiv Detail & Related papers (2026-01-01T21:14:11Z)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents [58.00130492861884]
TraitBasis is a lightweight, model-agnostic method for systematically stress testing AI agents.<n>TraitBasis learns directions in activation space corresponding to steerable user traits.<n>We observe on average a 2%-30% performance degradation on $tau$-Trait across frontier models.
arXiv Detail & Related papers (2025-10-06T05:03:57Z)
Agent-Testing Agent: A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents [2.3429263075112288]
We present the Agent-Testing Agent (ATA), a meta-agent that combines static code analysis, designer interrogation, literature mining, and persona-driven adversarial test generation.<n>Each dialogue is scored with an LLM-as-a-Judge (LAAJ) rubric and used to steer subsequent tests toward the agent's weakest capabilities.
arXiv Detail & Related papers (2025-08-24T15:02:13Z)
From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking [48.90371827091671]
AutoExperiment is a benchmark that evaluates AI agents' ability to implement and run machine learning experiments.<n>We evaluate state-of-the-art agents and find that performance degrades rapidly as $n$ increases.<n>Our findings highlight critical challenges in long-horizon code generation, context retrieval, and autonomous experiment execution.
arXiv Detail & Related papers (2025-06-24T15:39:20Z)
TestAgent: An Adaptive and Intelligent Expert for Human Assessment [62.060118490577366]
We propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement.<n>TestAgent supports personalized question selection, captures test-takers' responses and anomalies, and provides precise outcomes through dynamic, conversational interactions.
arXiv Detail & Related papers (2025-06-03T16:07:54Z)
Validation of massively-parallel adaptive testing using dynamic control matching [0.0]
Modern businesses often run many A/B/n tests at the same time and in parallel, and package many content variations into the same messages. This paper presents a method for disentangling the causal effects of the various tests under conditions of continuous test adaptation.
arXiv Detail & Related papers (2023-05-02T11:28:12Z)
Sequential Kernelized Independence Testing [77.237958592189]
We design sequential kernelized independence tests inspired by kernelized dependence measures.<n>We demonstrate the power of our approaches on both simulated and real data.
arXiv Detail & Related papers (2022-12-14T18:08:42Z)
A Search-Based Testing Approach for Deep Reinforcement Learning Agents [1.1580916951856255]
We propose a Search-based Testing Approach of Reinforcement Learning Agents (STARLA) to test the policy of a DRL agent. We use machine learning models and a dedicated genetic algorithm to narrow the search towards faulty episodes.
arXiv Detail & Related papers (2022-06-15T20:51:33Z)
Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework [68.96770035057716]
A/B testing is a business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. This paper introduces a reinforcement learning framework for carrying A/B testing in online experiments.
arXiv Detail & Related papers (2020-02-05T10:25:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.