Impact of LLM-based Review Comment Generation in Practice: A Mixed Open-/Closed-source User Study
- URL: http://arxiv.org/abs/2411.07091v1
- Date: Mon, 11 Nov 2024 16:12:11 GMT
- Title: Impact of LLM-based Review Comment Generation in Practice: A Mixed Open-/Closed-source User Study
- Authors: Doriane Olewicki, Leuson Da Silva, Suhaib Mujahid, Arezou Amini, Benjamin Mah, Marco Castelluccio, Sarra Habchi, Foutse Khomh, Bram Adams,
- Abstract summary: This user study was performed in two organizations, Mozilla and Ubisoft.
We observed that 8.1% and 7.2%, respectively, of LLM-generated comments were accepted by reviewers in each organization.
Refactoring-related comments are more likely to be accepted than Functional comments.
- Score: 13.650356901064807
- License:
- Abstract: We conduct a large-scale empirical user study in a live setup to evaluate the acceptance of LLM-generated comments and their impact on the review process. This user study was performed in two organizations, Mozilla (which has its codebase available as open source) and Ubisoft (fully closed-source). Inside their usual review environment, participants were given access to RevMate, an LLM-based assistive tool suggesting generated review comments using an off-the-shelf LLM with Retrieval Augmented Generation to provide extra code and review context, combined with LLM-as-a-Judge, to auto-evaluate the generated comments and discard irrelevant cases. Based on more than 587 patch reviews provided by RevMate, we observed that 8.1% and 7.2%, respectively, of LLM-generated comments were accepted by reviewers in each organization, while 14.6% and 20.5% other comments were still marked as valuable as review or development tips. Refactoring-related comments are more likely to be accepted than Functional comments (18.2% and 18.6% compared to 4.8% and 5.2%). The extra time spent by reviewers to inspect generated comments or edit accepted ones (36/119), yielding an overall median of 43s per patch, is reasonable. The accepted generated comments are as likely to yield future revisions of the revised patch as human-written comments (74% vs 73% at chunk-level).
Related papers
- Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub.
83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z) - LLM-Cure: LLM-based Competitor User Review Analysis for Feature Enhancement [0.7285835869818668]
We propose a large language model (LLM)-based Competitive User Review Analysis for Feature Enhancement.
LLM-Cure identifies and categorizes features within reviews by applying LLMs.
When provided with a complaint in a user review, LLM-Cure curates highly rated (4 and 5 stars) reviews in competing apps related to the complaint.
arXiv Detail & Related papers (2024-09-24T04:17:21Z) - AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews [18.50142644126276]
We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons.
We fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs.
We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality.
arXiv Detail & Related papers (2024-08-19T19:10:38Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - An Empirical Study on Code Review Activity Prediction and Its Impact in Practice [7.189276599254809]
This paper aims to help code reviewers by predicting which files in a submitted patch need to be commented, (2) revised, or (3) are hot-spots (commented or revised)
Our empirical study on three open-source and two industrial datasets shows that combining the code embedding and review process features leads to better results than the state-of-the-art approach.
arXiv Detail & Related papers (2024-04-16T16:20:02Z) - MARG: Multi-Agent Review Generation for Scientific Papers [28.78019426139167]
We develop MARG, a feedback generation approach using multiple LLM instances that engage in internal discussion.
By distributing paper text across agents, MARG can consume the full text of papers beyond the input length limitations of the base LLM.
In a user study, baseline methods using GPT-4 were rated as producing generic or very generic comments more than half the time.
Our system substantially improves the ability of GPT-4 to generate specific and helpful feedback, reducing the rate of generic comments from 60% to 29% and generating 3.7 good comments per paper (a 2.2x improvement)
arXiv Detail & Related papers (2024-01-08T22:24:17Z) - Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities.
Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics.
Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z) - A Closer Look into Automatic Evaluation Using Large Language Models [75.49360351036773]
We discuss how details in the evaluation process change how well the ratings given by LLMs correlate with human ratings.
We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings.
We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal.
arXiv Detail & Related papers (2023-10-09T12:12:55Z) - Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models.
Reference-free evaluators are more suitable for open-ended examples with different semantics responses.
There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z) - ReAct: A Review Comment Dataset for Actionability (and more) [0.8885727065823155]
We introduce an annotated review comment dataset ReAct.
The review comments are sourced from OpenReview site.
We crowd-source annotations for these reviews for actionability and type of comments.
arXiv Detail & Related papers (2022-10-02T07:09:38Z) - Automating App Review Response Generation [67.58267006314415]
We propose a novel approach RRGen that automatically generates review responses by learning knowledge relations between reviews and their responses.
Experiments on 58 apps and 309,246 review-response pairs highlight that RRGen outperforms the baselines by at least 67.4% in terms of BLEU-4.
arXiv Detail & Related papers (2020-02-10T05:23:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.