Related papers: Studying Quality Improvements Recommended via Manual and Automated Code Review

Studying Quality Improvements Recommended via Manual and Automated Code Review

URL: http://arxiv.org/abs/2602.11925v1
Date: Thu, 12 Feb 2026 13:23:43 GMT
Title: Studying Quality Improvements Recommended via Manual and Automated Code Review
Authors: Giuseppe Crupi, Rosalia Tufano, Gabriele Bavota,
Abstract summary: We study the similarities and differences between code reviews performed by humans and those automatically generated by Deep Learning models.<n>We show that while ChatGPT tends to recommend a higher number of code changes as compared to human reviewers, it can only spot 10% of the quality issues reported by humans.<n>This finding suggests that, in its current state, DL-based code review can be used as a further quality check on top of the one performed by humans.
Score: 14.067404766521607
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Several Deep Learning (DL)-based techniques have been proposed to automate code review. Still, it is unclear the extent to which these approaches can recommend quality improvements as a human reviewer. We study the similarities and differences between code reviews performed by humans and those automatically generated by DL models, using ChatGPT-4 as representative of the latter. In particular, we run a mining-based study in which we collect and manually inspect 739 comments posted by human reviewers to suggest code changes in 240 PRs. The manual inspection aims at classifying the type of quality improvement recommended by human reviewers (e.g., rename variable/constant). Then, we ask ChatGPT to perform a code review on the same PRs and we compare the quality improvements it recommends against those suggested by the human reviewers. We show that while, on average, ChatGPT tends to recommend a higher number of code changes as compared to human reviewers (~2.4x more), it can only spot 10% of the quality issues reported by humans. However, ~40% of the additional comments generated by the LLM point to meaningful quality issues. In short, our findings show the complementarity of manual and AI-based code review. This finding suggests that, in its current state, DL-based code review can be used as a further quality check on top of the one performed by humans, but should not be considered as a valid alternative to them nor as a mean to save code review time, since human reviewers would still need to perform their manual inspection while also validating the quality issues reported by the DL-based technique.

Related papers

Is Peer Review Really in Decline? Analyzing Review Quality across Venues and Time [55.756345497678204]
We introduce a new framework for evidence-based comparative study of review quality.<n>We apply it to major AI and machine learning conferences: ICLR, NeurIPS and *ACL.<n>We study the relationships between measurements of review quality, and its evolution over time.
arXiv Detail & Related papers (2026-01-21T16:48:29Z)
Reviewing the Reviewer: Elevating Peer Review Quality through LLM-Guided Feedback [75.31379834079648]
We introduce an LLM-driven framework that decomposes reviews into argumentative segments.<n>We also release LazyReviewPlus, a dataset of 1,309 sentences labeled for lazy thinking and specificity.
arXiv Detail & Related papers (2026-01-17T20:32:18Z)
On Assessing the Relevance of Code Reviews Authored by Generative Models [4.096540146408279]
We propose a novel evaluation approach based on what we call multi-subjective ranking.<n>Using a dataset of 280 self-contained code review requests and corresponding comments from CodeReview StackExchange, multiple human judges ranked the quality of ChatGPT-generated comments alongside the top human responses from the platform.<n>Results show that ChatGPT's comments were ranked significantly better than human ones, even surpassing StackExchange's accepted answers.
arXiv Detail & Related papers (2025-12-17T14:12:31Z)
CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection [60.52240468810558]
We introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews.<n>We also develop CoCoDet, an AI review detector via a multi-task learning framework, to achieve more accurate and robust detection of AI involvement in review content.
arXiv Detail & Related papers (2025-08-28T06:03:11Z)
LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews [74.87393214734114]
This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories.<n>Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting.<n> instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points.
arXiv Detail & Related papers (2025-04-15T10:07:33Z)
Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword? [14.970843824847956]
We run a controlled experiment with 29 experts who reviewed different programs with/without the support of an automatically generated code review.<n>We show that reviewers consider valid most of the issues automatically identified by the LLM and that the availability of an automated review as a starting point strongly influences their behavior.<n>The reviewers who started from an automated review identified a higher number of low-severity issues while, however, not identifying more high-severity issues as compared to a completely manual process.
arXiv Detail & Related papers (2024-11-18T09:24:01Z)
Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub. 83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z)
CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells [15.66562304661042]
CRScore is a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance.<n>We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment among open source metrics.<n>We also release a corpus of 2.9k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.
arXiv Detail & Related papers (2024-09-29T21:53:18Z)
Leveraging Reviewer Experience in Code Review Comment Generation [11.224317228559038]
We train deep learning models to imitate human reviewers in providing natural language code reviews.<n>The quality of the model generated reviews remain sub-optimal due to the quality of the open-source code review data used in model training.<n>We propose a suite of experience-aware training methods that utilise the reviewers' past authoring and reviewing experiences as signals for review quality.
arXiv Detail & Related papers (2024-09-17T07:52:50Z)
Improving Automated Code Reviews: Learning from Experience [12.573740138977065]
This study investigates whether higher-quality reviews can be generated from automated code review models. We find that experience-aware oversampling can increase the correctness, level of information, and meaningfulness of reviews.
arXiv Detail & Related papers (2024-02-06T07:48:22Z)
Deep Just-In-Time Inconsistency Detection Between Comments and Source Code [51.00904399653609]
In this paper, we aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code. We develop a deep-learning approach that learns to correlate a comment with code changes. We show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system.
arXiv Detail & Related papers (2020-10-04T16:49:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.