Related papers: Code Review Automation: Strengths and Weaknesses of the State of the Art

Code Review Automation: Strengths and Weaknesses of the State of the Art

URL: http://arxiv.org/abs/2401.05136v1
Date: Wed, 10 Jan 2024 13:00:18 GMT
Title: Code Review Automation: Strengths and Weaknesses of the State of the Art
Authors: Rosalia Tufano, Ozren Dabi\'c, Antonio Mastropaolo, Matteo Ciniselli, and Gabriele Bavota
Abstract summary: Three code review automation techniques tend to succeed or fail in two tasks described in this paper. The study has a strong qualitative focus, with 105 man-hours of manual inspection invested in analyzing correct and wrong predictions.
Score: 14.313783664862923
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The automation of code review has been tackled by several researchers with the goal of reducing its cost. The adoption of deep learning in software engineering pushed the automation to new boundaries, with techniques imitating developers in generative tasks, such as commenting on a code change as a reviewer would do or addressing a reviewer's comment by modifying code. The performance of these techniques is usually assessed through quantitative metrics, e.g., the percentage of instances in the test set for which correct predictions are generated, leaving many open questions on the techniques' capabilities. For example, knowing that an approach is able to correctly address a reviewer's comment in 10% of cases is of little value without knowing what was asked by the reviewer: What if in all successful cases the code change required to address the comment was just the removal of an empty line? In this paper we aim at characterizing the cases in which three code review automation techniques tend to succeed or fail in the two above-described tasks. The study has a strong qualitative focus, with ~105 man-hours of manual inspection invested in manually analyzing correct and wrong predictions generated by the three techniques, for a total of 2,291 inspected predictions. The output of this analysis are two taxonomies reporting, for each of the two tasks, the types of code changes on which the experimented techniques tend to succeed or to fail, pointing to areas for future work. A result of our manual analysis was also the identification of several issues in the datasets used to train and test the experimented techniques. Finally, we assess the importance of researching in techniques specialized for code review automation by comparing their performance with ChatGPT, a general purpose large language model, finding that ChatGPT struggles in commenting code as a human reviewer would do.

Related papers

LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews [74.87393214734114]
This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points.
arXiv Detail & Related papers (2025-04-15T10:07:33Z)
Automating Code Review: A Systematic Literature Review [15.416725497289697]
Code Review consists in assessing the code written by teammates with the goal of increasing code quality. Empirical studies documented the benefits brought by such a practice that, however, has its cost to pay in terms of developers' time. Researchers have proposed techniques and tools to automate code review tasks.
arXiv Detail & Related papers (2025-03-12T16:19:10Z)
Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword? [14.970843824847956]
We run a controlled experiment with 29 experts who reviewed different programs with/without the support of an automatically generated code review. We show that reviewers consider valid most of the issues automatically identified by the LLM and that the availability of an automated review as a starting point strongly influences their behavior. The reviewers who started from an automated review identified a higher number of low-severity issues while, however, not identifying more high-severity issues as compared to a completely manual process.
arXiv Detail & Related papers (2024-11-18T09:24:01Z)
Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub. 83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z)
Predicting Expert Evaluations in Software Code Reviews [8.012861163935904]
This paper presents an algorithmic model that automates aspects of code review typically avoided due to their complexity or subjectivity. Instead of replacing manual reviews, our model adds insights that help reviewers focus on more impactful tasks.
arXiv Detail & Related papers (2024-09-23T16:01:52Z)
An Empirical Study on Code Review Activity Prediction and Its Impact in Practice [7.189276599254809]
This paper aims to help code reviewers by predicting which files in a submitted patch need to be commented, (2) revised, or (3) are hot-spots (commented or revised) Our empirical study on three open-source and two industrial datasets shows that combining the code embedding and review process features leads to better results than the state-of-the-art approach.
arXiv Detail & Related papers (2024-04-16T16:20:02Z)
Improving Automated Code Reviews: Learning from Experience [12.573740138977065]
This study investigates whether higher-quality reviews can be generated from automated code review models. We find that experience-aware oversampling can increase the correctness, level of information, and meaningfulness of reviews.
arXiv Detail & Related papers (2024-02-06T07:48:22Z)
Generation Probabilities Are Not Enough: Uncertainty Highlighting in AI Code Completions [54.55334589363247]
We study whether conveying information about uncertainty enables programmers to more quickly and accurately produce code. We find that highlighting tokens with the highest predicted likelihood of being edited leads to faster task completion and more targeted edits.
arXiv Detail & Related papers (2023-02-14T18:43:34Z)
CodeReviewer: Pre-Training for Automating Code Review Activities [36.40557768557425]
This research focuses on utilizing pre-training techniques for the tasks in the code review scenario. We collect a large-scale dataset of real world code changes and code reviews from open-source projects in nine of the most popular programming languages. To better understand code diffs and reviews, we propose CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review senario.
arXiv Detail & Related papers (2022-03-17T05:40:13Z)
ProtoTransformer: A Meta-Learning Approach to Providing Student Feedback [54.142719510638614]
In this paper, we frame the problem of providing feedback as few-shot classification. A meta-learner adapts to give feedback to student code on a new programming question from just a few examples by instructors. Our approach was successfully deployed to deliver feedback to 16,000 student exam-solutions in a programming course offered by a tier 1 university.
arXiv Detail & Related papers (2021-07-23T22:41:28Z)
Can We Automate Scientific Reviewing? [89.50052670307434]
We discuss the possibility of using state-of-the-art natural language processing (NLP) models to generate first-pass peer reviews for scientific papers. We collect a dataset of papers in the machine learning domain, annotate them with different aspects of content covered in each review, and train targeted summarization models that take in papers to generate reviews. Comprehensive experimental results show that system-generated reviews tend to touch upon more aspects of the paper than human-written reviews.
arXiv Detail & Related papers (2021-01-30T07:16:53Z)
Hierarchical Bi-Directional Self-Attention Networks for Paper Review Rating Recommendation [81.55533657694016]
We propose a Hierarchical bi-directional self-attention Network framework (HabNet) for paper review rating prediction and recommendation. Specifically, we leverage the hierarchical structure of the paper reviews with three levels of encoders: sentence encoder (level one), intra-review encoder (level two) and inter-review encoder (level three) We are able to identify useful predictors to make the final acceptance decision, as well as to help discover the inconsistency between numerical review ratings and text sentiment conveyed by reviewers.
arXiv Detail & Related papers (2020-11-02T08:07:50Z)
Deep Just-In-Time Inconsistency Detection Between Comments and Source Code [51.00904399653609]
In this paper, we aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code. We develop a deep-learning approach that learns to correlate a comment with code changes. We show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system.
arXiv Detail & Related papers (2020-10-04T16:49:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.