A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era
- URL: http://arxiv.org/abs/2602.13377v1
- Date: Fri, 13 Feb 2026 18:19:38 GMT
- Title: A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era
- Authors: Taufiqul Islam Khan, Shaowei Wang, Haoxiang Zhang, Tse-Hsun Chen,
- Abstract summary: Code review is a critical practice in modern software engineering, helping developers detect defects early, improve code quality, and facilitate knowledge sharing.<n>With the rapid advancement of large language models (LLMs), a growing body of work has explored automated support for code review.<n>Current code review datasets are scattered, vary widely in design, and provide limited insight into what review capabilities are actually being assessed.
- Score: 10.935053388447372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code review is a critical practice in modern software engineering, helping developers detect defects early, improve code quality, and facilitate knowledge sharing. With the rapid advancement of large language models (LLMs), a growing body of work has explored automated support for code review. However, progress in this area is hindered by the lack of a systematic understanding of existing benchmarks and evaluation practices. Current code review datasets are scattered, vary widely in design, and provide limited insight into what review capabilities are actually being assessed. In this paper, we present a comprehensive survey of code review benchmarks spanning both the Pre-LLM and LLM eras (2015--2025). We analyze 99 research papers (58 Pre-LLM era and 41 LLM era) and extract key metadata, including datasets, evaluation metrics, data sources, and target tasks. Based on this analysis, we propose a multi-level taxonomy that organizes code review research into five domains and 18 fine-grained tasks. Our study reveals a clear shift toward end-to-end generative peer review, increasing multilingual coverage, and a decline in standalone change understanding tasks. We further identify limitations of current benchmarks and outline future directions, including broader task coverage, dynamic runtime evaluation, and taxonomy-guided fine-grained assessment. This survey provides a structured foundation for developing more realistic and comprehensive benchmarks for LLM-based code review.
Related papers
- Previously on... Automating Code Review [4.096540146408279]
Modern Code Review (MCR) is a standard practice in software engineering, yet it demands substantial time and resource investments.<n>Recent research has increasingly explored automating core review tasks using machine learning (ML) and deep learning (DL)<n>This study provides the first comprehensive analysis of MCR automation research.
arXiv Detail & Related papers (2025-08-25T13:12:48Z) - MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z) - Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks [229.73714829399802]
This survey probes the core challenges that the rise of Large Language Models poses for evaluation.<n>We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety.<n>We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics.
arXiv Detail & Related papers (2025-04-26T07:48:52Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - A Survey on Evaluating Large Language Models in Code Generation Tasks [30.256255254277914]
This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks.<n>With the rapid growth in demand for automated software development, LLMs have demonstrated significant potential in the field of code generation.
arXiv Detail & Related papers (2024-08-29T12:56:06Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review [4.181146104301203]
Large Language Models (LLMs) have been developed to assist programming tasks including the generation of program code from natural language input.
This paper provides a critical review of the existing work on the testing and evaluation of these tools.
arXiv Detail & Related papers (2024-06-18T14:25:34Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Automating Patch Set Generation from Code Review Comments Using Large Language Models [2.045040820541428]
We provide code contexts to five popular Large Language Models (LLMs)
We obtain the suggested code-changes (patch sets) derived from real-world code-review comments.
The performance of each model is meticulously assessed by comparing their generated patch sets against the historical data of human-generated patch-sets.
arXiv Detail & Related papers (2024-04-10T02:46:08Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.