Related papers: Rethinking Code Review Workflows with LLM Assistance: An Empirical Study

Rethinking Code Review Workflows with LLM Assistance: An Empirical Study

URL: http://arxiv.org/abs/2505.16339v1
Date: Thu, 22 May 2025 07:54:07 GMT
Title: Rethinking Code Review Workflows with LLM Assistance: An Empirical Study
Authors: Fannar Steinn Aðalsteinsson, Björn Borgar Magnússon, Mislav Milicevic, Adam Nirving Davidsson, Chih-Hong Cheng,
Abstract summary: This paper combines an exploratory field study of current code review practices with a field experiment involving two variations of an LLM-assisted code review tool.<n>The study identifies key challenges in traditional code reviews, including frequent context switching and insufficient contextual information.<n>In the field experiment, we developed two prototype variations: one offering LLM-generated reviews upfront and the other enabling on-demand interaction.
Score: 2.9593087583214173
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code reviews are a critical yet time-consuming aspect of modern software development, increasingly challenged by growing system complexity and the demand for faster delivery. This paper presents a study conducted at WirelessCar Sweden AB, combining an exploratory field study of current code review practices with a field experiment involving two variations of an LLM-assisted code review tool. The field study identifies key challenges in traditional code reviews, including frequent context switching, insufficient contextual information, and highlights both opportunities (e.g., automatic summarization of complex pull requests) and concerns (e.g., false positives and trust issues) in using LLMs. In the field experiment, we developed two prototype variations: one offering LLM-generated reviews upfront and the other enabling on-demand interaction. Both utilize a semantic search pipeline based on retrieval-augmented generation to assemble relevant contextual information for the review, thereby tackling the uncovered challenges. Developers evaluated both variations in real-world settings: AI-led reviews are overall more preferred, while still being conditional on the reviewers' familiarity with the code base, as well as on the severity of the pull request.

Related papers

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Code Review as Decision-Making -- Building a Cognitive Model from the Questions Asked During Code Review [2.8299846354183953]
We build a cognitive model of code review bottom up through thematic, statistical, temporal, and sequential analysis of the transcribed material.<n>The model shows how developers move through two phases during the code review; first an orientation phase to establish context and rationale, then an analytical phase to understand, assess, and plan the rest of the review.
arXiv Detail & Related papers (2025-07-13T14:04:16Z)
Towards Practical Defect-Focused Automated Code Review [8.370750734081088]
We explore the full automation pipeline within the online recommendation service, analyzing industry-grade C++s.<n>We identify four key challenges: 1) capturing relevant context, 2) improving key inclusion, 3) reducing false alarm rates (FAR), and 4) integrating human bug slicing.<n>Our approach, validated on real-world merge requests from historical fault reports, achieves a 2x improvement over standard LLMs and a 10x gain over previous baselines.
arXiv Detail & Related papers (2025-05-23T14:06:26Z)
Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.<n>We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.<n>Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z)
Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics [1.3707925738322797]
We focus on LLM-based code evaluation and attempt to fill in the existing gaps.<n>We propose multi-agentic novel approaches using question-specific rubrics tailored to the problem statement.<n>Our comprehensive analysis demonstrates that question-specific rubrics significantly enhance logical assessment of code in educational settings.
arXiv Detail & Related papers (2025-03-31T11:59:43Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development. We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM) We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z)
HumanEvalComm: Benchmarking the Communication Competence of Code Generation for LLMs and LLM Agent [2.8391355909797644]
Large language models (LLMs) have significantly improved their ability to perform tasks in the field of code generation.<n>There is still a gap between LLMs being capable coders and being top-tier software engineers.
arXiv Detail & Related papers (2024-05-31T22:06:18Z)
Automating Patch Set Generation from Code Review Comments Using Large Language Models [2.045040820541428]
We provide code contexts to five popular Large Language Models (LLMs) We obtain the suggested code-changes (patch sets) derived from real-world code-review comments. The performance of each model is meticulously assessed by comparing their generated patch sets against the historical data of human-generated patch-sets.
arXiv Detail & Related papers (2024-04-10T02:46:08Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
Code Reviewer Recommendation Based on a Hypergraph with Multiplex Relationships [30.74556500021384]
We present MIRRec, a novel code reviewer recommendation method that leverages a hypergraph with multiplex relationships. MIRRec encodes high-order correlations that go beyond traditional pairwise connections using degree-free hyperedges among pull requests and developers. To validate the effectiveness of MIRRec, we conducted experiments using a dataset comprising 48,374 pull requests from ten popular open-source software projects hosted on GitHub.
arXiv Detail & Related papers (2024-01-19T15:25:14Z)
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge. We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.