On The Impact of Merge Request Deviations on Code Review Practices
- URL: http://arxiv.org/abs/2506.08860v2
- Date: Wed, 11 Jun 2025 01:21:02 GMT
- Title: On The Impact of Merge Request Deviations on Code Review Practices
- Authors: Samah Kansab, Francis Bordeleau, Ali Tizghadam,
- Abstract summary: We term these cases deviations and hypothesize that ignoring them biases analytics and undermines ML models for review analysis.<n>We identify seven deviation categories, occurring in 37.02% of MRs, and propose a few-shot learning detection method (91% accuracy)<n>Our contributions include: (1) a taxonomy of MR deviations, (2) an AI-driven detection approach, and (3) empirical evidence of their impact on ML-based review analytics.
- Score: 2.8402080392117757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code review is a key practice in software engineering, ensuring quality and collaboration. However, industrial Merge Request (MR) workflows often deviate from standardized review processes, with many MRs serving non-review purposes (e.g., drafts, rebases, or dependency updates). We term these cases deviations and hypothesize that ignoring them biases analytics and undermines ML models for review analysis. We identify seven deviation categories, occurring in 37.02% of MRs, and propose a few-shot learning detection method (91% accuracy). By excluding deviations, ML models predicting review completion time improve performance in 53.33% of cases (up to 2.25x) and exhibit significant shifts in feature importance (47% overall, 60% top-*k*). Our contributions include: (1) a taxonomy of MR deviations, (2) an AI-driven detection approach, and (3) empirical evidence of their impact on ML-based review analytics. This work aids practitioners in optimizing review efforts and ensuring reliable insights.
Related papers
- Refactoring $\
eq$ Bug-Inducing: Improving Defect Prediction with Code Change Tactics Analysis [54.361900378970134]
Just-in-time defect prediction (JIT-DP) aims to predict the likelihood of code changes resulting in software defects at an early stage.<n>Prior research has largely ignored code during both the evaluation and methodology phases, despite its prevalence.<n>We propose Code chAnge Tactics (CAT) analysis to categorize code and its propagation, which improves labeling accuracy in the JIT-Defects4J dataset by 13.7%.
arXiv Detail & Related papers (2025-07-25T23:29:25Z) - Learning Software Bug Reports: A Systematic Literature Review [4.019641745947759]
Machine learning (ML) aims to automate the understanding, extraction, and correlation of information from bug reports.<n>Despite its growing importance, there has been no comprehensive review in this area.<n>In this paper, we present a systematic literature review covering 1,825 papers, selecting 204 for detailed analysis.
arXiv Detail & Related papers (2025-07-06T15:17:59Z) - Meta-Fair: AI-Assisted Fairness Testing of Large Language Models [2.9632404823837777]
Fairness is a core principle in the development of Artificial Intelligence (AI) systems.<n>Current approaches to fairness testing in large language models (LLMs) often rely on manual evaluation, fixed templates, deterministics, and curated datasets.<n>This work aims to lay the groundwork for a novel, automated method for testing fairness in LLMs.
arXiv Detail & Related papers (2025-07-03T11:20:59Z) - Enhancement Report Approval Prediction: A Comparative Study of Large Language Models [10.243182983724585]
Enhancement reports (ERs) serve as a critical communication channel between users and developers, capturing valuable suggestions for software improvement.<n>To address this challenge, enhancement report approval prediction (ERAP) has emerged as a research focus.<n>Recent advancements in large language models (LLM) present new opportunities for enhancing prediction accuracy.
arXiv Detail & Related papers (2025-06-18T03:08:04Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.<n>We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)<n>We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z) - Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations.
We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer.
Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z) - Eliminating Position Bias of Language Models: A Mechanistic Approach [119.34143323054143]
Position bias has proven to be a prevalent issue of modern language models (LMs)<n>Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings.<n>By eliminating position bias, models achieve better performance and reliability in downstream tasks, including LM-as-a-judge, retrieval-augmented QA, molecule generation, and math reasoning.
arXiv Detail & Related papers (2024-07-01T09:06:57Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - A First Look at Fairness of Machine Learning Based Code Reviewer
Recommendation [14.50773969815661]
This paper conducts the first study toward investigating the issue of fairness of ML applications in the software engineering (SE) domain.
Our empirical study demonstrates that current state-of-the-art ML-based code reviewer recommendation techniques exhibit unfairness and discriminating behaviors.
This paper also discusses the reasons why the studied ML-based code reviewer recommendation systems are unfair and provides solutions to mitigate the unfairness.
arXiv Detail & Related papers (2023-07-21T01:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.