Related papers: CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects

CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects

URL: http://arxiv.org/abs/2509.14856v3
Date: Thu, 23 Oct 2025 05:44:56 GMT
Title: CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects
Authors: Hanyang Guo, Xunjin Zheng, Zihan Liao, Hang Yu, Peng DI, Ziyin Zhang, Hong-Ning Dai,
Abstract summary: We introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation.<n>CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains.<n>We present the first large-scale assessment of state-of-the-art Large Language Models (LLMs) on this comprehensive CR task.
Score: 23.9752442213364
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated code review (CR) is a key application for Large Language Models (LLMs), but progress is hampered by a "reality gap": existing benchmarks evaluate models on isolated sub-tasks using simplified, context-poor data. This fails to reflect the holistic context-rich nature of real-world CR. To bridge this gap, we introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation. CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains, where each instance provides rich, multi-faceted context including the associated issue, PR details, and repository state, enabling end-to-end evaluation. Beyond superficial metrics, we also propose a novel evaluation framework that combines rule-based checks for location and syntax with model-based judgments of review quality. We present the first large-scale assessment of state-of-the-art LLMs on this comprehensive CR task. Our results establish crucial baselines and reveal that (1) no single LLM dominates all aspects of CR; (2) Gemini 2.5 Pro achieves the highest comprehensive performance; and (3) different LLMs exhibit varying robustness to redundant context. These findings highlight the necessity of holistic, multi-dimensional evaluation and provide actionable insights for advancing truly intelligent yet practical CR assistants.

Related papers

AACR-Bench: Evaluating Automatic Code Review with Holistic Repository-Level Context [10.769682566098695]
AACR-Bench is a comprehensive benchmark that provides full cross-file context across multiple programming languages.<n>Unlike traditional datasets, AACR-Bench employs an "AI-assisted, Expert-verified" annotation pipeline to uncover latent defects.
arXiv Detail & Related papers (2026-01-27T11:28:44Z)
Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice [18.222990693059756]
ContextCRBench is a benchmark for fine-grained LLM evaluation in code review.<n>It collects 153.7K issues and pull requests from top-tier repositories.<n>It supports three evaluation scenarios aligned with the review workflow.
arXiv Detail & Related papers (2025-11-10T12:06:35Z)
RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering [50.42577862494645]
We present RAG-IGBench, a benchmark designed to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering.<n>RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content.
arXiv Detail & Related papers (2025-10-11T03:06:39Z)
MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval [86.35779264575154]
Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios.<n>We introduce MR$2$-Bench, a reasoning-intensive benchmark for multimodal retrieval.
arXiv Detail & Related papers (2025-09-30T15:09:14Z)
Benchmarking and Studying the LLM-based Code Review [34.93646390349726]
Current benchmarks frequently focus on fine-grained code units, lack complete project context, and use inadequate evaluation metrics.<n>We introduce SWRBench, a new benchmark offering PR-centric review with full project context.<n>Our contributions include the SWRBench benchmark, its objective evaluation method, a comprehensive study of current ACR capabilities, and an effective enhancement approach.
arXiv Detail & Related papers (2025-09-01T14:13:34Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z)
CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents [6.359764486371197]
Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging.<n>We present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically.<n>CReSt comprises 2,245 human-annotated examples in English and Korean, designed to capture practical RAG scenarios that require complex reasoning over structured documents.
arXiv Detail & Related papers (2025-05-23T05:56:25Z)
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation [19.071855537400463]
Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance.<n>CoCo-Bench is designed to evaluate LLMs across four critical dimensions: code understanding, code generation, code modification, and code review.
arXiv Detail & Related papers (2025-04-29T11:57:23Z)
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models [97.18215355266143]
We introduce a holistic code critique benchmark for Large Language Models (LLMs) called CodeCriticBench.<n>Specifically, our CodeCriticBench includes two mainstream code tasks (i.e., code generation and code QA) with different difficulties.<n>Besides, the evaluation protocols include basic critique evaluation and advanced critique evaluation for different characteristics.
arXiv Detail & Related papers (2025-02-23T15:36:43Z)
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
Beyond Correctness: Benchmarking Multi-dimensional Code Generation for Large Language Models [43.56644186785491]
This paper proposes the RACE benchmark, which comprehensively evaluates the quality of code generated by large language models. We analyze 28 representative LLMs based on RACE and find that current correctness-centric benchmarks fail to capture the multifaceted requirements of code in real-world scenarios.
arXiv Detail & Related papers (2024-07-16T08:08:48Z)
DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models [4.953092503184905]
This work proposes DCR, an automated framework for evaluating and improving the consistency of Large Language Models (LLMs) generated texts. We introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score. Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
arXiv Detail & Related papers (2024-01-04T08:34:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.