Related papers: Predicting Expert Evaluations in Software Code Reviews

Predicting Expert Evaluations in Software Code Reviews

URL: http://arxiv.org/abs/2409.15152v1
Date: Mon, 23 Sep 2024 16:01:52 GMT
Title: Predicting Expert Evaluations in Software Code Reviews
Authors: Yegor Denisov-Blanch, Igor Ciobanu, Simon Obstbaum, Michal Kosinski,
Abstract summary: This paper presents an algorithmic model that automates aspects of code review typically avoided due to their complexity or subjectivity. Instead of replacing manual reviews, our model adds insights that help reviewers focus on more impactful tasks.
Score: 8.012861163935904
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Manual code reviews are an essential but time-consuming part of software development, often leading reviewers to prioritize technical issues while skipping valuable assessments. This paper presents an algorithmic model that automates aspects of code review typically avoided due to their complexity or subjectivity, such as assessing coding time, implementation time, and code complexity. Instead of replacing manual reviews, our model adds insights that help reviewers focus on more impactful tasks. Calibrated using expert evaluations, the model predicts key metrics from code commits with strong correlations to human judgments (r = 0.82 for coding time, r = 0.86 for implementation time). By automating these assessments, we reduce the burden on human reviewers and ensure consistent analysis of time-consuming areas, offering a scalable solution alongside manual reviews. This research shows how automated tools can enhance code reviews by addressing overlooked tasks, supporting data-driven decisions and improving the review process.

Related papers

LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews [74.87393214734114]
This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points.
arXiv Detail & Related papers (2025-04-15T10:07:33Z)
Identifying Aspects in Peer Reviews [61.374437855024844]
We develop a data-driven schema for deriving fine-grained aspects from a corpus of peer reviews. We introduce a dataset of peer reviews augmented with aspects and show how it can be used for community-level review analysis.
arXiv Detail & Related papers (2025-04-09T14:14:42Z)
Automating Code Review: A Systematic Literature Review [15.416725497289697]
Code Review consists in assessing the code written by teammates with the goal of increasing code quality. Empirical studies documented the benefits brought by such a practice that, however, has its cost to pay in terms of developers' time. Researchers have proposed techniques and tools to automate code review tasks.
arXiv Detail & Related papers (2025-03-12T16:19:10Z)
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models [97.18215355266143]
We introduce a holistic code critique benchmark for Large Language Models (LLMs) called CodeCriticBench. Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties. Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics.
arXiv Detail & Related papers (2025-02-23T15:36:43Z)
BitsAI-CR: Automated Code Review via LLM in Practice [16.569842114384233]
BitsAI-CR is an innovative framework that enhances code review through a two-stage approach. System is built upon a comprehensive taxonomy of review rules and implements a data flywheel mechanism. Empirical evaluation demonstrates BitsAI-CR's effectiveness, achieving 75.0% precision in review comment generation.
arXiv Detail & Related papers (2025-01-25T08:39:50Z)
Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models. We first propose AutoMathCritique, an automated and scalable framework for collecting critique data. We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z)
Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword? [14.970843824847956]
We run a controlled experiment with 29 experts who reviewed different programs with/without the support of an automatically generated code review. We show that reviewers consider valid most of the issues automatically identified by the LLM and that the availability of an automated review as a starting point strongly influences their behavior. The reviewers who started from an automated review identified a higher number of low-severity issues while, however, not identifying more high-severity issues as compared to a completely manual process.
arXiv Detail & Related papers (2024-11-18T09:24:01Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
Leveraging Reviewer Experience in Code Review Comment Generation [11.224317228559038]
We train deep learning models to imitate human reviewers in providing natural language code reviews. The quality of the model generated reviews remain sub-optimal due to the quality of the open-source code review data used in model training. We propose a suite of experience-aware training methods that utilise the reviewers' past authoring and reviewing experiences as signals for review quality.
arXiv Detail & Related papers (2024-09-17T07:52:50Z)
Leveraging Large Language Models for Efficient Failure Analysis in Game Development [47.618236610219554]
This paper proposes a new approach to automatically identify which change in the code caused a test to fail. The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure. Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year.
arXiv Detail & Related papers (2024-06-11T09:21:50Z)
Improving Automated Code Reviews: Learning from Experience [12.573740138977065]
This study investigates whether higher-quality reviews can be generated from automated code review models. We find that experience-aware oversampling can increase the correctness, level of information, and meaningfulness of reviews.
arXiv Detail & Related papers (2024-02-06T07:48:22Z)
Code Review Automation: Strengths and Weaknesses of the State of the Art [14.313783664862923]
Three code review automation techniques tend to succeed or fail in two tasks described in this paper. The study has a strong qualitative focus, with 105 man-hours of manual inspection invested in analyzing correct and wrong predictions.
arXiv Detail & Related papers (2024-01-10T13:00:18Z)
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing [139.77117915309023]
CRITIC allows large language models to validate and amend their own outputs in a manner similar to human interaction with tools. Comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs.
arXiv Detail & Related papers (2023-05-19T15:19:44Z)
Predicting Code Review Completion Time in Modern Code Review [12.696276129130332]
Modern Code Review (MCR) is being adopted in both open source and commercial projects as a common practice. Code reviews can experience significant delays to be completed due to various socio-technical factors. There is a lack of tool support to help developers estimating the time required to complete a code review.
arXiv Detail & Related papers (2021-09-30T14:00:56Z)
Deep Just-In-Time Inconsistency Detection Between Comments and Source Code [51.00904399653609]
In this paper, we aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code. We develop a deep-learning approach that learns to correlate a comment with code changes. We show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system.
arXiv Detail & Related papers (2020-10-04T16:49:28Z)
Automating App Review Response Generation [67.58267006314415]
We propose a novel approach RRGen that automatically generates review responses by learning knowledge relations between reviews and their responses. Experiments on 58 apps and 309,246 review-response pairs highlight that RRGen outperforms the baselines by at least 67.4% in terms of BLEU-4.
arXiv Detail & Related papers (2020-02-10T05:23:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.