Repeated Builds During Code Review: An Empirical Study of the OpenStack
Community
- URL: http://arxiv.org/abs/2308.10078v1
- Date: Sat, 19 Aug 2023 17:45:03 GMT
- Title: Repeated Builds During Code Review: An Empirical Study of the OpenStack
Community
- Authors: Rungroj Maipradit, Dong Wang, Patanamon Thongtanunam, Raula Gaikovina
Kula, Yasutaka Kamei, Shane McIntosh
- Abstract summary: We conduct an empirical study of 66,932 code reviews from the community.
We observe that (i) 55% of code reviews invoke the recheck command after a failing build is reported; (ii) invoking the recheck command only changes the outcome of a failing build in 42% of the cases; and (iii) invoking the recheck command increases review waiting time by an average of 2,200%.
- Score: 11.289146650622662
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code review is a popular practice where developers critique each others'
changes. Since automated builds can identify low-level issues (e.g., syntactic
errors, regression bugs), it is not uncommon for software organizations to
incorporate automated builds in the code review process. In such code review
deployment scenarios, submitted change sets must be approved for integration by
both peer code reviewers and automated build bots. Since automated builds may
produce an unreliable signal of the status of a change set (e.g., due to
``flaky'' or non-deterministic execution behaviour), code review tools, such as
Gerrit, allow developers to request a ``recheck'', which repeats the build
process without updating the change set. We conjecture that an unconstrained
recheck command will waste time and resources if it is not applied judiciously.
To explore how the recheck command is applied in a practical setting, in this
paper, we conduct an empirical study of 66,932 code reviews from the OpenStack
community.
We quantitatively analyze (i) how often build failures are rechecked; (ii)
the extent to which invoking recheck changes build failure outcomes; and (iii)
how much waste is generated by invoking recheck. We observe that (i) 55% of
code reviews invoke the recheck command after a failing build is reported; (ii)
invoking the recheck command only changes the outcome of a failing build in 42%
of the cases; and (iii) invoking the recheck command increases review waiting
time by an average of 2,200% and equates to 187.4 compute years of waste --
enough compute resources to compete with the oldest land living animal on
earth.
Related papers
- CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification [71.34070740261072]
This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases.
The benchmark is containerized for code execution across tasks, and we will release the code, data, and construction methodologies.
arXiv Detail & Related papers (2025-02-12T21:42:56Z) - Automated Code Review In Practice [1.6271516689052665]
Several AI-assisted tools, such as Qodo, GitHub Copilot, and Coderabbit, provide automated reviews using large language models (LLMs)
This study examines the impact of LLM-based automated code review tools in an industrial setting.
arXiv Detail & Related papers (2024-12-24T16:24:45Z) - Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword? [14.970843824847956]
We run a controlled experiment with 29 experts who reviewed different programs with/without the support of an automatically generated code review.
We show that reviewers consider valid most of the issues automatically identified by the LLM and that the availability of an automated review as a starting point strongly influences their behavior.
The reviewers who started from an automated review identified a higher number of low-severity issues while, however, not identifying more high-severity issues as compared to a completely manual process.
arXiv Detail & Related papers (2024-11-18T09:24:01Z) - Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub.
83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z) - Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells [15.66562304661042]
CRScore is a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance.
We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment among open source metrics.
We also release a corpus of 2.9k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.
arXiv Detail & Related papers (2024-09-29T21:53:18Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.
We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.
We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - Leveraging Large Language Models for Efficient Failure Analysis in Game Development [47.618236610219554]
This paper proposes a new approach to automatically identify which change in the code caused a test to fail.
The method leverages Large Language Models (LLMs) to associate error messages with the corresponding code changes causing the failure.
Our approach reaches an accuracy of 71% in our newly created dataset, which comprises issues reported by developers at EA over a period of one year.
arXiv Detail & Related papers (2024-06-11T09:21:50Z) - Code Review Automation: Strengths and Weaknesses of the State of the Art [14.313783664862923]
Three code review automation techniques tend to succeed or fail in two tasks described in this paper.
The study has a strong qualitative focus, with 105 man-hours of manual inspection invested in analyzing correct and wrong predictions.
arXiv Detail & Related papers (2024-01-10T13:00:18Z) - Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and
Bystanders [6.538051328482194]
We build upon the recommender that has been in production since 2018 RevRecV1.
We find that reviewers were being assigned based on prior authorship of files.
Having an individual who is responsible for the review, reduces the time take for reviews by -11%.
arXiv Detail & Related papers (2023-12-28T17:55:13Z) - Predicting Code Review Completion Time in Modern Code Review [12.696276129130332]
Modern Code Review (MCR) is being adopted in both open source and commercial projects as a common practice.
Code reviews can experience significant delays to be completed due to various socio-technical factors.
There is a lack of tool support to help developers estimating the time required to complete a code review.
arXiv Detail & Related papers (2021-09-30T14:00:56Z) - Deep Just-In-Time Inconsistency Detection Between Comments and Source
Code [51.00904399653609]
In this paper, we aim to detect whether a comment becomes inconsistent as a result of changes to the corresponding body of code.
We develop a deep-learning approach that learns to correlate a comment with code changes.
We show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system.
arXiv Detail & Related papers (2020-10-04T16:49:28Z) - Automating App Review Response Generation [67.58267006314415]
We propose a novel approach RRGen that automatically generates review responses by learning knowledge relations between reviews and their responses.
Experiments on 58 apps and 309,246 review-response pairs highlight that RRGen outperforms the baselines by at least 67.4% in terms of BLEU-4.
arXiv Detail & Related papers (2020-02-10T05:23:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.