Related papers: CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

URL: http://arxiv.org/abs/2409.19801v1
Date: Sun, 29 Sep 2024 21:53:18 GMT
Title: CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells
Authors: Atharva Naik, Marcus Alenius, Daniel Fried, Carolyn Rose,
Abstract summary: We develop a CRScore to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment. We also release a corpus of 2.6k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.
Score: 15.66562304661042
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change (also called a diff), even though code review is a one-to-many problem like generation and summarization with many "valid reviews" for a diff. To tackle these issues we develop a CRScore - a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We design CRScore to evaluate reviews in a way that is grounded in claims and potential issues detected in the code by LLMs and static analyzers. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment (0.54 Spearman correlation) and are more sensitive than reference-based metrics. We also release a corpus of 2.6k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.

Related papers

LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews [74.87393214734114]
This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points.
arXiv Detail & Related papers (2025-04-15T10:07:33Z)
Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025 [115.86204862475864]
Review Feedback Agent provides automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers. It was implemented at ICLR 2025 as a large randomized control study. 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers.
arXiv Detail & Related papers (2025-04-13T22:01:25Z)
Deep Assessment of Code Review Generation Approaches: Beyond Lexical Similarity [27.92468098611616]
We propose two novel semantic-based approaches for assessing code reviews. The first approach involves converting both the generated review and its reference into digital vectors using a deep learning model. The second approach generates a prompt based on the generated review and its reference, submits this prompt to ChatGPT, and requests ChatGPT to rate the generated review.
arXiv Detail & Related papers (2025-01-09T11:52:32Z)
Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword? [14.970843824847956]
We run a controlled experiment with 29 experts who reviewed different programs with/without the support of an automatically generated code review. We show that reviewers consider valid most of the issues automatically identified by the LLM and that the availability of an automated review as a starting point strongly influences their behavior. The reviewers who started from an automated review identified a higher number of low-severity issues while, however, not identifying more high-severity issues as compared to a completely manual process.
arXiv Detail & Related papers (2024-11-18T09:24:01Z)
Predicting Expert Evaluations in Software Code Reviews [8.012861163935904]
This paper presents an algorithmic model that automates aspects of code review typically avoided due to their complexity or subjectivity. Instead of replacing manual reviews, our model adds insights that help reviewers focus on more impactful tasks.
arXiv Detail & Related papers (2024-09-23T16:01:52Z)
Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged. In this paper, we study if there are any deficiencies in reference-free metrics. We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z)
Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and Bystanders [6.538051328482194]
We build upon the recommender that has been in production since 2018 RevRecV1. We find that reviewers were being assigned based on prior authorship of files. Having an individual who is responsible for the review, reduces the time take for reviews by -11%.
arXiv Detail & Related papers (2023-12-28T17:55:13Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization [52.720711541731205]
We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models. Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
arXiv Detail & Related papers (2023-10-27T13:09:54Z)
Exploring the Advances in Identifying Useful Code Review Comments [0.0]
This paper reflects the evolution of research on the usefulness of code review comments. It examines papers that define the usefulness of code review comments, mine and annotate datasets, study developers' perceptions, analyze factors from different aspects, and use machine learning classifiers to automatically predict the usefulness of code review comments.
arXiv Detail & Related papers (2023-07-03T00:41:20Z)
What Makes a Code Review Useful to OpenDev Developers? An Empirical Investigation [4.061135251278187]
Even a minor improvement in the effectiveness of Code Reviews can incur significant savings for a software development organization. This study aims to develop a finer grain understanding of what makes a code review comment useful to OSS developers.
arXiv Detail & Related papers (2023-02-22T22:48:27Z)
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics. We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z)
Deep Just-In-Time Inconsistency Detection Between Comments and Source Code [51.00904399653609]
In this paper, we aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code. We develop a deep-learning approach that learns to correlate a comment with code changes. We show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system.
arXiv Detail & Related papers (2020-10-04T16:49:28Z)
Automating App Review Response Generation [67.58267006314415]
We propose a novel approach RRGen that automatically generates review responses by learning knowledge relations between reviews and their responses. Experiments on 58 apps and 309,246 review-response pairs highlight that RRGen outperforms the baselines by at least 67.4% in terms of BLEU-4.
arXiv Detail & Related papers (2020-02-10T05:23:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.