Benchmarking and Studying the LLM-based Code Review
- URL: http://arxiv.org/abs/2509.01494v1
- Date: Mon, 01 Sep 2025 14:13:34 GMT
- Title: Benchmarking and Studying the LLM-based Code Review
- Authors: Zhengran Zeng, Ruikai Shi, Keke Han, Yixin Li, Kaicheng Sun, Yidong Wang, Zhuohao Yu, Rui Xie, Wei Ye, Shikun Zhang,
- Abstract summary: Current benchmarks frequently focus on fine-grained code units, lack complete project context, and use inadequate evaluation metrics.<n>We introduce SWRBench, a new benchmark offering PR-centric review with full project context.<n>Our contributions include the SWRBench benchmark, its objective evaluation method, a comprehensive study of current ACR capabilities, and an effective enhancement approach.
- Score: 34.93646390349726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated Code Review (ACR) is crucial for software quality, yet existing benchmarks often fail to reflect real-world complexities, hindering the evaluation of modern Large Language Models (LLMs). Current benchmarks frequently focus on fine-grained code units, lack complete project context, and use inadequate evaluation metrics. To address these limitations, we introduce SWRBench , a new benchmark comprising 1000 manually verified Pull Requests (PRs) from GitHub, offering PR-centric review with full project context. SWRBench employs an objective LLM-based evaluation method that aligns strongly with human judgment (~90 agreement) by verifying if issues from a structured ground truth are covered in generated reviews. Our systematic evaluation of mainstream ACR tools and LLMs on SWRBench reveals that current systems underperform, and ACR tools are more adept at detecting functional errors. Subsequently, we propose and validate a simple multi-review aggregation strategy that significantly boosts ACR performance, increasing F1 scores by up to 43.67%. Our contributions include the SWRBench benchmark, its objective evaluation method, a comprehensive study of current ACR capabilities, and an effective enhancement approach, offering valuable insights for advancing ACR research.
Related papers
- AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context [10.769682566098695]
AACR-Bench is a comprehensive benchmark that provides full cross-file context across multiple programming languages.<n>Unlike traditional datasets, AACR-Bench employs an "AI-assisted, Expert-verified" annotation pipeline to uncover latent defects.
arXiv Detail & Related papers (2026-01-27T11:28:44Z) - Reviewing the Reviewer: Elevating Peer Review Quality through LLM-Guided Feedback [75.31379834079648]
We introduce an LLM-driven framework that decomposes reviews into argumentative segments.<n>We also release LazyReviewPlus, a dataset of 1,309 sentences labeled for lazy thinking and specificity.
arXiv Detail & Related papers (2026-01-17T20:32:18Z) - CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects [23.9752442213364]
We introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation.<n>CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains.<n>We present the first large-scale assessment of state-of-the-art Large Language Models (LLMs) on this comprehensive CR task.
arXiv Detail & Related papers (2025-09-18T11:24:09Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - How Much Do Large Language Model Cheat on Evaluation? Benchmarking Overestimation under the One-Time-Pad-Based Framework [8.76693832650115]
Overestimation in evaluating large language models (LLMs) has become an increasing concern.<n>We propose ArxivRoll, a dynamic evaluation framework inspired by one-time pad encryption in cryptography.
arXiv Detail & Related papers (2025-07-25T12:39:03Z) - RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z) - Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z) - LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z) - BitsAI-CR: Automated Code Review via LLM in Practice [16.569842114384233]
BitsAI-CR is an innovative framework that enhances code review through a two-stage approach.<n>System is built upon a comprehensive taxonomy of review rules and implements a data flywheel mechanism.<n> Empirical evaluation demonstrates BitsAI-CR's effectiveness, achieving 75.0% precision in review comment generation.
arXiv Detail & Related papers (2025-01-25T08:39:50Z) - RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.861013614500024]
We introduce a new benchmark designed to assess the critique capabilities of Large Language Models (LLMs)<n>Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques.
arXiv Detail & Related papers (2025-01-24T13:48:10Z) - DeepCRCEval: Revisiting the Evaluation of Code Review Comment Generation [11.010557279355885]
This study empirically analyzes benchmark comments using a novel set of criteria informed by prior research and developer interviews.<n>Our evaluation framework, DeepCRCEval, integrates human evaluators and Large Language Models (LLMs) for a comprehensive reassessment of current techniques.
arXiv Detail & Related papers (2024-12-24T08:53:54Z) - The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [1.3810901729134184]
Large Language Models (LLMs) excel at standardized tests while failing to demonstrate genuine language understanding and adaptability.<n>Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum.<n>We lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks.
arXiv Detail & Related papers (2024-12-02T20:49:21Z) - Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [52.76508734756661]
Auto-PRE is an automatic evaluation framework inspired by the peer review process.<n>Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluators based on three core traits.<n> Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-10-16T06:06:06Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.