The Impact of Large Language Models (LLMs) on Code Review Process
- URL: http://arxiv.org/abs/2508.11034v1
- Date: Thu, 14 Aug 2025 19:39:01 GMT
- Title: The Impact of Large Language Models (LLMs) on Code Review Process
- Authors: Antonio Collante, Samuel Abedu, SayedHassan Khatoonabadi, Ahmad Abdellatif, Ebube Alor, Emad Shihab,
- Abstract summary: This research investigates the effect of GPT on GitHub pull request (PR)<n>We curated a dataset of 25,473 PRs from 9,254 GitHub projects.<n>We identified GPT-assisted PRs using a semi-automated approach that combines keyword-based detection, regular expression filtering, and manual verification.
- Score: 2.8071068465772853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have recently gained prominence in the field of software development, significantly boosting productivity and simplifying teamwork. Although prior studies have examined task-specific applications, the phase-specific effects of LLM assistance on the efficiency of code review processes remain underexplored. This research investigates the effect of GPT on GitHub pull request (PR) workflows, with a focus on reducing resolution time, optimizing phase-specific performance, and assisting developers. We curated a dataset of 25,473 PRs from 9,254 GitHub projects and identified GPT-assisted PRs using a semi-automated heuristic approach that combines keyword-based detection, regular expression filtering, and manual verification until achieving 95% labeling accuracy. We then applied statistical modeling, including multiple linear regression and Mann-Whitney U test, to evaluate differences between GPT-assisted and non-assisted PRs, both at the overall resolution level and across distinct review phases. Our research has revealed that early adoption of GPT can substantially boost the effectiveness of the PR process, leading to considerable time savings at various stages. Our findings suggest that GPT-assisted PRs reduced median resolution time by more than 60% (9 hours compared to 23 hours for non-assisted PRs). We discovered that utilizing GPT can reduce the review time by 33% and the waiting time before acceptance by 87%. Analyzing a sample dataset of 300 GPT-assisted PRs, we discovered that developers predominantly use GPT for code optimization (60%), bug fixing (26%), and documentation updates (12%). This research sheds light on the impact of the GPT model on the code review process, offering actionable insights for software teams seeking to enhance workflows and promote seamless collaboration.
Related papers
- A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models [53.31664844941449]
ProActive Self-Refinement (PASR) is a novel method for improving large language models (LLMs)<n>Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context.<n>We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR.
arXiv Detail & Related papers (2025-08-18T13:07:21Z) - DeputyDev -- AI Powered Developer Assistant: Breaking the Code Review Logjam through Contextual AI to Boost Developer Productivity [38.585498338645856]
This study investigates the implementation and efficacy of DeputyDev.<n>DeputyDev is an AI-powered code review assistant developed to address inefficiencies in the software development process.
arXiv Detail & Related papers (2025-08-13T10:09:45Z) - PatchTrack: A Comprehensive Analysis of ChatGPT's Influence on Pull Request Outcomes [0.0]
We analyze pull requests from 255 GitHub repositories containing self-admitted ChatGPT usage.<n>We introduce PatchTrack, a tool that classifies whether ChatGPT patches were applied, not applied, or not suggested.<n>A qualitative analysis of 89 pull requests with integrated patches revealed recurring patterns of structural integration, selective extraction, and iterative refinement.
arXiv Detail & Related papers (2025-05-12T16:09:33Z) - Exploring Reliable PPG Authentication on Smartwatches in Daily Scenarios [28.98973058936974]
Photoplethysmography ( PPG) sensors are widely deployed in smartwatches and offer a simple and non-invasive authentication approach for daily use.<n>We propose MTL-RAPID, an efficient and reliable PPG authentication model, simultaneously assessing signal quality and verifying user identity.<n>In our comprehensive user studies regarding motion artifacts, time variations, and user preferences, MTL-RAPID achieves a best AUC of 99.2% and an EER of 3.5%, outperforming existing baselines.
arXiv Detail & Related papers (2025-03-31T10:25:48Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Revisiting the Power of Prompt for Visual Tuning [50.11465784194896]
This study explores the correlation evolvement between prompts and patch tokens during proficient training.
Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes.
Our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%.
arXiv Detail & Related papers (2024-02-04T07:49:02Z) - A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair [19.123640635549524]
Large Language Models (LLMs) have been gaining increasing attention and demonstrated promising performance across a variety of software engineering tasks.
This paper reviews the bug-fixing capabilities of ChatGPT on a clean APR benchmark with different research objectives.
ChatGPT is able to fix 109 out of 151 buggy programs using the basic prompt within 35 independent rounds, outperforming state-of-the-art LLMs CodeT5 and PLBART by 27.5% and 62.4% prediction accuracy.
arXiv Detail & Related papers (2023-10-13T06:11:47Z) - Large Language Models (GPT) for automating feedback on programming
assignments [0.0]
We employ OpenAI's GPT-3.5 model to generate personalized hints for students solving programming assignments.
Students rated the usefulness of GPT-generated hints positively.
arXiv Detail & Related papers (2023-06-30T21:57:40Z) - Is Self-Repair a Silver Bullet for Code Generation? [68.02601393906083]
Large language models have shown remarkable aptitude in code generation, but still struggle to perform complex tasks.
Self-repair -- in which the model debugs and repairs its own code -- has recently become a popular way to boost performance.
We analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval and APPS.
arXiv Detail & Related papers (2023-06-16T15:13:17Z) - Progressive-Hint Prompting Improves Reasoning in Large Language Models [63.98629132836499]
This paper proposes a new prompting method, named Progressive-Hint Prompting (PHP)
It enables automatic multiple interactions between users and Large Language Models (LLMs) by using previously generated answers as hints to progressively guide toward the correct answers.
We conducted extensive and comprehensive experiments on seven benchmarks. The results show that PHP significantly improves accuracy while remaining highly efficient.
arXiv Detail & Related papers (2023-04-19T16:29:48Z) - Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining.
We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data.
Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z) - What Makes Good In-Context Examples for GPT-$3$? [101.99751777056314]
GPT-$3$ has attracted lots of attention due to its superior performance across a wide range of NLP tasks.
Despite its success, we found that the empirical results of GPT-$3$ depend heavily on the choice of in-context examples.
In this work, we investigate whether there are more effective strategies for judiciously selecting in-context examples.
arXiv Detail & Related papers (2021-01-17T23:38:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.