Related papers: Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls

Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls

URL: http://arxiv.org/abs/2512.16272v1
Date: Thu, 18 Dec 2025 07:43:48 GMT
Title: Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls
Authors: Ora Nova Fandina, Eitan Farchi, Shmulik Froimovich, Raviv Gal, Wesam Ibraheem, Rami Katan, Alice Podolsky,
Abstract summary: Large Language Models are increasingly deployed as judges (LaaJ) in code generation pipelines.<n>LaaJs tend to overlook domain specific issues raising concerns about their reliability in critical evaluation tasks.<n>We develop a lightweight analytic checker tool that flags over 30 domain specific issues observed in practice.<n>We use its outputs as analytic hints, dynamically injecting them into the judges prompt to encourage LaaJ to revisit aspects it may have overlooked.
Score: 2.4484932263697234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models are increasingly deployed as judges (LaaJ) in code generation pipelines. While attractive for scalability, LaaJs tend to overlook domain specific issues raising concerns about their reliability in critical evaluation tasks. To better understand these limitations in practice, we examine LaaJ behavior in a concrete industrial use case: legacy code modernization via COBOL code generation. In this setting, we find that even production deployed LaaJs can miss domain critical errors, revealing consistent blind spots in their evaluation capabilities. To better understand these blind spots, we analyze generated COBOL programs and associated LaaJs judgments, drawing on expert knowledge to construct a preliminary taxonomy. Based on this taxonomy, we develop a lightweight analytic checker tool that flags over 30 domain specific issues observed in practice. We use its outputs as analytic hints, dynamically injecting them into the judges prompt to encourage LaaJ to revisit aspects it may have overlooked. Experiments on a test set of 100 programs using four production level LaaJs show that LaaJ alone detects only about 45% of the errors present in the code (in all judges we tested), while the analytic checker alone lacks explanatory depth. When combined, the LaaJ+Hints configuration achieves up to 94% coverage (for the best performing judge and injection prompt) and produces qualitatively richer, more accurate explanations, demonstrating that analytic-LLM hybrids can substantially enhance evaluation reliability in deployed pipelines. We release the dataset and all used prompts.

Related papers

CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis [7.007981312278749]
We introduce CryptoAnalystBench, an analyst aligned benchmark of 198 production crypto and DeFi queries spanning 11 categories.<n>We develop a taxonomy of seven higher order error types that are not reliably captured by factuality checks or LLM based quality scoring.<n>We find that these failures persist even in state of the art systems and can compromise high stakes decisions.
arXiv Detail & Related papers (2026-02-11T19:29:31Z)
Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes [2.9195489041890297]
Large Language Models as a Judge (LaaJ) offer a scalable alternative to expert review.<n>Without validation, organizations risk a circular evaluation loop, where unverified LaaJs are used to assess model outputs.<n>We introduce SparseAlign, a formal framework for assessing LaaJ alignment with sparse human-labeled data.
arXiv Detail & Related papers (2025-10-31T07:27:54Z)
Test Case Generation from Bug Reports via Large Language Models: A Cognitive Layered Evaluation Framework [10.919459368597295]
We present a systematic evaluation of Large Language Models (LLMs) reasoning in test case generation.<n>We evaluate StarCoder and GPT-4o on Defects4J, GHRB, and mutated variants that introduce linguistic and semantic challenges.
arXiv Detail & Related papers (2025-10-06T20:47:12Z)
LaajMeter: A Framework for LaaJ Evaluation [1.8583060903632522]
Large Language Models (LLMs) are increasingly used as evaluators in natural language processing tasks.<n>LaaJMeter is a simulation-based framework for controlled meta-evaluation of LaaJs.
arXiv Detail & Related papers (2025-08-13T19:51:05Z)
ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation [51.297873393639456]
ArtifactsBench is a framework for automated visual code generation evaluation.<n>Our framework renders each generated artifact and captures its dynamic behavior through temporal screenshots.<n>We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading Large Language Models.
arXiv Detail & Related papers (2025-07-07T12:53:00Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation [1.7268889851975326]
We introduce WebApp1K, a novel benchmark for evaluating large language models (LLMs) in test-driven development (TDD) tasks.<n>Unlike traditional approaches relying on natural language prompts, our benchmark emphasizes the ability of LLMs to interpret and implement functionality directly from test cases.
arXiv Detail & Related papers (2025-05-13T23:47:12Z)
Integrating Expert Knowledge into Logical Programs via LLMs [3.637365301757111]
ExKLoP is a framework designed to evaluate how effectively Large Language Models integrate expert knowledge into logical reasoning systems.<n>This capability is especially valuable in engineering, where expert knowledge-such as manufacturer-recommended operational ranges-can be directly embedded into automated monitoring systems.
arXiv Detail & Related papers (2025-02-17T19:18:23Z)
Assessing the Answerability of Queries in Retrieval-Augmented Code Generation [7.68409881755304]
This study proposes a task for evaluating answerability, which assesses whether valid answers can be generated. We build a benchmark dataset called Retrieval-augmented Code Generability Evaluation (RaCGEval) to evaluate the performance of models performing this task.
arXiv Detail & Related papers (2024-11-08T13:09:14Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.