LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead
- URL: http://arxiv.org/abs/2510.24367v1
- Date: Tue, 28 Oct 2025 12:44:54 GMT
- Title: LLM-as-a-Judge for Software Engineering: Literature Review, Vision, and the Road Ahead
- Authors: Junda He, Jieke Shi, Terry Yue Zhuo, Christoph Treude, Jiamou Sun, Zhenchang Xing, Xiaoning Du, David Lo,
- Abstract summary: This paper aims to steer the community toward advancing LLM-as-a-Judge for evaluating software artifacts.<n>We envision these frameworks as reliable, robust, and scalable human surrogates capable of consistent, multi-faceted artifact evaluation by 2030.
- Score: 27.124885915455426
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid integration of Large Language Models (LLMs) into software engineering (SE) has revolutionized tasks like code generation, producing a massive volume of software artifacts. This surge has exposed a critical bottleneck: the lack of scalable, reliable methods to evaluate these outputs. Human evaluation is costly and time-consuming, while traditional automated metrics like BLEU fail to capture nuanced quality aspects. In response, the LLM-as-a-Judge paradigm - using LLMs for automated evaluation - has emerged. This approach leverages the advanced reasoning of LLMs, offering a path toward human-like nuance at automated scale. However, LLM-as-a-Judge research in SE is still in its early stages. This forward-looking SE 2030 paper aims to steer the community toward advancing LLM-as-a-Judge for evaluating LLM-generated software artifacts. We provide a literature review of existing SE studies, analyze their limitations, identify key research gaps, and outline a detailed roadmap. We envision these frameworks as reliable, robust, and scalable human surrogates capable of consistent, multi-faceted artifact evaluation by 2030. Our work aims to foster research and adoption of LLM-as-a-Judge frameworks, ultimately improving the scalability of software artifact evaluation.
Related papers
- LLMs' Reshaping of People, Processes, Products, and Society in Software Development: A Comprehensive Exploration with Early Adopters [3.4069804433026314]
Large language models (LLMs) like OpenAI ChatGPT, Google Gemini, and GitHub Copilot are rapidly gaining traction in the software industry.<n>Our study provides a nuanced understanding of how LLMs are shaping the landscape of software development.
arXiv Detail & Related papers (2025-03-06T22:27:05Z) - From Code to Courtroom: LLMs as the New Software Judges [29.77858458399232]
Large Language Models (LLMs) have been increasingly used to automate software engineering tasks such as code generation and summarization.<n>Human evaluation, while effective, is very costly and time-consuming.<n>The LLM-as-a-Judge paradigm, which employs LLMs for automated evaluation, has emerged.
arXiv Detail & Related papers (2025-03-04T03:48:23Z) - Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering [18.766132076075365]
Large language models (LLMs) have been deployed to tackle various software engineering (SE) tasks like code generation.<n>Pass@k metric requires extensive unit tests and configured environments, and is not suitable for evaluating LLM-generated text.<n>Conventional metrics like BLEU, which measure only lexical rather than semantic similarity, have also come under scrutiny.
arXiv Detail & Related papers (2025-02-10T06:49:29Z) - From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future [15.568939568441317]
We investigate the current practice and solutions for large language models (LLMs) and LLM-based agents for software engineering.<n>In particular we summarise six key topics: requirement engineering, code generation, autonomous decision-making, software design, test generation, and software maintenance.<n>We discuss the models and benchmarks used, providing a comprehensive analysis of their applications and effectiveness in software engineering.
arXiv Detail & Related papers (2024-08-05T14:01:15Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - Towards an Understanding of Large Language Models in Software Engineering Tasks [29.30433406449331]
Large Language Models (LLMs) have drawn widespread attention and research due to their astounding performance in text generation and reasoning tasks.<n>The evaluation and optimization of LLMs in software engineering tasks, such as code generation, have become a research focus.<n>This paper comprehensively investigate and collate the research and products combining LLMs with software engineering.
arXiv Detail & Related papers (2023-08-22T12:37:29Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.