Related papers: Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation

Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation

URL: http://arxiv.org/abs/2505.16222v1
Date: Thu, 22 May 2025 04:49:33 GMT
Title: Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation
Authors: Jiwon Moon, Yerin Hwang, Dongryeol Lee, Taegwan Kang, Yongil Kim, Kyomin Jung,
Abstract summary: Using large language models as evaluators has expanded to code evaluation tasks.<n>This raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with superficial variations?<n>We present the first comprehensive study of this issue, defining six types of potential bias in code evaluation.
Score: 14.521056434373213
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: With the growing use of large language models(LLMs) as evaluators, their application has expanded to code evaluation tasks, where they assess the correctness of generated code without relying on reference implementations. While this offers scalability and flexibility, it also raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with superficial variations? Functionally correct code often exhibits variations-such as differences in variable names, comments, or formatting-that should not influence its correctness. Yet, whether LLM judges can reliably handle these variations remains unclear. We present the first comprehensive study of this issue, defining six types of potential bias in code evaluation and revealing their systematic impact on LLM judges. Across five programming languages and multiple LLMs, we empirically demonstrate that all tested LLM judges are susceptible to both positive and negative biases, resulting in inflated or unfairly low scores. Moreover, we observe that LLM judges remain vulnerable to these biases even when prompted to generate test cases before scoring, highlighting the need for more robust code evaluation methods.

Related papers

LLMs on Trial: Evaluating Judicial Fairness for Large Language Models [18.895994052898754]
Large Language Models (LLMs) are increasingly used in high-stakes fields where their decisions impact rights and equity.<n>LLMs' judicial fairness and implications for social justice remain underexplored.<n>We construct a comprehensive framework to measure LLM fairness, leading to a selection of 65 labels and 161 corresponding values.<n>Applying this framework to the judicial system, we compile an extensive dataset, JudiFair, comprising 177,100 unique case facts.
arXiv Detail & Related papers (2025-07-14T22:56:58Z)
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks [63.562924932512765]
Large Language Models (LLMs) have advanced the state-of-the-art in various coding tasks.<n>LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models.
arXiv Detail & Related papers (2025-07-14T17:56:29Z)
Quantitative LLM Judges [48.676042957523045]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to human scores in a given domain.<n>The models are trained to improve the score of the original judge by using the judge's textual evaluation and score.<n>Our experiments show that quantitative judges can effectively improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z)
JudgeBench: A Benchmark for Evaluating LLM-based Judges [61.048125269475854]
JudgeBench is a benchmark for evaluating LLM-based judges on challenging response pairs spanning knowledge, reasoning, math, and coding.<n>Our comprehensive evaluation on a collection of prompted judges, fine-tuned judges, multi-agent judges, and reward models shows that JudgeBench poses a significantly greater challenge than previous benchmarks.
arXiv Detail & Related papers (2024-10-16T17:58:19Z)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z)
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates [11.948519516797745]
We develop an open-source framework to evaluate, compare, and visualize the reliability and alignment of LLM judges.<n>Our results indicate a significant impact of prompt templates on LLM judge performance, as well as a mediocre alignment level between the tested LLM judges and human evaluators.
arXiv Detail & Related papers (2024-08-23T11:49:01Z)
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. Code reasoning is one of the most essential abilities of code LLMs. We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z)
A Comprehensive Evaluation of Large Language Models on Legal Judgment Prediction [60.70089334782383]
Large language models (LLMs) have demonstrated great potential for domain-specific applications. Recent disputes over GPT-4's law evaluation raise questions concerning their performance in real-world legal tasks. We design practical baseline solutions based on LLMs and test on the task of legal judgment prediction.
arXiv Detail & Related papers (2023-10-18T07:38:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.