Human-Agent versus Human Pull Requests: A Testing-Focused Characterization and Comparison
- URL: http://arxiv.org/abs/2601.21194v1
- Date: Thu, 29 Jan 2026 02:50:02 GMT
- Title: Human-Agent versus Human Pull Requests: A Testing-Focused Characterization and Comparison
- Authors: Roberto Milanese, Francesco Salzano, Angelica Spina, Antonio Vitale, Remo Pareschi, Fausto Fasano, Mattia Fazzini,
- Abstract summary: This paper presents an empirical study of 6,582 human-agent PRs (HAPRs) and 3,122 human PRs (HPRs) from the AIDev dataset.<n>We compare HAPRs and HPRs along three dimensions: (i) testing frequency and extent, (ii) types of testing-related changes, and (iii) testing quality, measured by test smells.
- Score: 0.5794954517255626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: AI-based coding agents are increasingly integrated into software development workflows, collaborating with developers to create pull requests (PRs). Despite their growing adoption, the role of human-agent collaboration in software testing remains poorly understood. This paper presents an empirical study of 6,582 human-agent PRs (HAPRs) and 3,122 human PRs (HPRs) from the AIDev dataset. We compare HAPRs and HPRs along three dimensions: (i) testing frequency and extent, (ii) types of testing-related changes (code-and-test co-evolution vs. test-focused), and (iii) testing quality, measured by test smells. Our findings reveal that, although the likelihood of including tests is comparable (42.9% for HAPRs vs. 40.0% for HPRs), HAPRs exhibit a larger extent of testing, nearly doubling the test-to-source line ratio found in HPRs. While test-focused task distributions are comparable, HAPRs are more likely to add new tests during co-evolution (OR=1.79), whereas HPRs prioritize modifying existing tests. Finally, although some test smell categories differ statistically, negligible effect sizes suggest no meaningful differences in quality. These insights provide the first characterization of how human-agent collaboration shapes testing practices.
Related papers
- Code Change Characteristics and Description Alignment: A Comparative Study of Agentic versus Human Pull Requests [0.0]
We analyze 33,596 agent-generated PRs and 6,618 human PRs to compare code-change characteristics and message quality.<n>Agents generate stronger commit-level messages but lag humans at PR-level summarization.<n>These findings highlight a gap between agents' micro-level precision and macro-level communication.
arXiv Detail & Related papers (2026-01-24T23:33:07Z) - Change And Cover: Last-Mile, Pull Request-Based Regression Test Augmentation [20.31612139450269]
Testing pull requests (PRs) is critical to maintaining software quality.<n>Some PR-modified lines remain untested, leaving a "last-mile" regression test gap.<n>We present ChaCo, an LLM-based test augmentation technique that addresses this gap.
arXiv Detail & Related papers (2026-01-16T02:08:16Z) - Do Autonomous Agents Contribute Test Code? A Study of Tests in Agentic Pull Requests [1.2043574473965317]
We present an empirical study of test inclusion in agentic pull requests using the AIDev dataset.<n>Across agents, test-containing PRs are more common over time and tend to be larger and take longer to complete.<n>We also observe variation across agents in both test adoption and the balance between test and production code within test PRs.
arXiv Detail & Related papers (2026-01-07T03:52:13Z) - SWE-RM: Execution-free Feedback For Software Engineering Agents [61.86380395896069]
Execution-based feedback is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL)<n>In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases.<n>We introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference.
arXiv Detail & Related papers (2025-12-26T08:26:18Z) - TestAgent: An Adaptive and Intelligent Expert for Human Assessment [62.060118490577366]
We propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement.<n>TestAgent supports personalized question selection, captures test-takers' responses and anomalies, and provides precise outcomes through dynamic, conversational interactions.
arXiv Detail & Related papers (2025-06-03T16:07:54Z) - On the Mistaken Assumption of Interchangeable Deep Reinforcement Learning Implementations [58.60617136236957]
Deep Reinforcement Learning (DRL) is a paradigm of artificial intelligence where an agent uses a neural network to learn which actions to take in a given environment.<n>DRL has recently gained traction from being able to solve complex environments like driving simulators, 3D robotic control, and multiplayer-online-battle-arena video games.<n>Numerous implementations of the state-of-the-art algorithms responsible for training these agents, like the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms, currently exist.
arXiv Detail & Related papers (2025-03-28T16:25:06Z) - Precise Error Rates for Computationally Efficient Testing [67.30044609837749]
We revisit the question of simple-versus-simple hypothesis testing with an eye towards computational complexity.<n>An existing test based on linear spectral statistics achieves the best possible tradeoff curve between type I and type II error rates.
arXiv Detail & Related papers (2023-11-01T04:41:16Z) - Two-Sample Testing on Ranked Preference Data and the Role of Modeling
Assumptions [57.77347280992548]
In this paper, we design two-sample tests for pairwise comparison data and ranking data.
Our test requires essentially no assumptions on the distributions.
By applying our two-sample test on real-world pairwise comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently.
arXiv Detail & Related papers (2020-06-21T20:51:09Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.