Related papers: Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes

Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes

URL: http://arxiv.org/abs/2510.27244v1
Date: Fri, 31 Oct 2025 07:27:54 GMT
Title: Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes
Authors: Ora Nova Fandina, Gal Amram, Eitan Farchi, Shmulik Froimovich, Raviv Gal, Wesam Ibraheem, Rami Katan, Alice Podolsky, Orna Raz,
Abstract summary: Large Language Models as a Judge (LaaJ) offer a scalable alternative to expert review.<n>Without validation, organizations risk a circular evaluation loop, where unverified LaaJs are used to assess model outputs.<n>We introduce SparseAlign, a formal framework for assessing LaaJ alignment with sparse human-labeled data.
Score: 2.9195489041890297
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Application modernization in legacy languages such as COBOL, PL/I, and REXX faces an acute shortage of resources, both in expert availability and in high-quality human evaluation data. While Large Language Models as a Judge (LaaJ) offer a scalable alternative to expert review, their reliability must be validated before being trusted in high-stakes workflows. Without principled validation, organizations risk a circular evaluation loop, where unverified LaaJs are used to assess model outputs, potentially reinforcing unreliable judgments and compromising downstream deployment decisions. Although various automated approaches to validating LaaJs have been proposed, alignment with human judgment remains a widely used and conceptually grounded validation strategy. In many real-world domains, the availability of human-labeled evaluation data is severely limited, making it difficult to assess how well a LaaJ aligns with human judgment. We introduce SparseAlign, a formal framework for assessing LaaJ alignment with sparse human-labeled data. SparseAlign combines a novel pairwise-confidence concept with a score-sensitive alignment metric that jointly capture ranking consistency and score proximity, enabling reliable evaluator selection even when traditional statistical methods are ineffective due to limited annotated examples. SparseAlign was applied internally to select LaaJs for COBOL code explanation. The top-aligned evaluators were integrated into assessment workflows, guiding model release decisions. We present a case study of four LaaJs to demonstrate SparseAlign's utility in real-world evaluation scenarios.

Related papers

The Validity of Coreference-based Evaluations of Natural Language Understanding [3.505146496638911]
I analyze standard coreference evaluations and show that their design often leads to non-generalizable conclusions.<n>I propose and implement a novel evaluation focused on testing systems' ability to infer the relative plausibility of events.
arXiv Detail & Related papers (2026-02-18T05:49:28Z)
LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation [40.06592175227558]
This paper investigates a principled approach to evaluating Retrieval-Augmented Generation systems in legal contexts.<n>We find that traditional agreement metrics like Krippendorff's alpha can be misleading in the skewed distributions typical of AI system evaluations.<n>Our findings suggest a path toward scalable, cost-effective evaluation that maintains the precision demanded by legal applications.
arXiv Detail & Related papers (2025-09-15T19:20:21Z)
Neither Valid nor Reliable? Investigating the Use of LLMs as Judges [23.16086453334644]
Large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored.<n>This paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators.
arXiv Detail & Related papers (2025-08-25T14:43:10Z)
SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection.<n>Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains.<n>We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z)
Bayesian Active Learning for Multi-Criteria Comparative Judgement in Educational Assessment [2.443343861973814]
Comparative Judgement (CJ) provides an alternative assessment approach by evaluating work holistically rather than breaking it into discrete criteria.<n>This method leverages human ability to make nuanced comparisons, yielding more reliable and valid assessments.<n> rubrics remain widely used in education, offering structured criteria for grading and detailed feedback.<n>This creates a gap between CJ's holistic ranking and the need for criterion-based performance breakdowns.
arXiv Detail & Related papers (2025-03-01T13:12:41Z)
Re-evaluating Open-ended Evaluation of Large Language Models [50.23008729038318]
We show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental.<n>We propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy.
arXiv Detail & Related papers (2025-02-27T15:07:47Z)
CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists [12.542045913426639]
CheckEval is a checklist-based evaluation framework that improves rating reliability via binary questions.<n>CheckEval dramatically improves the average agreement across evaluator models by 0.45 and reduces the score variance.
arXiv Detail & Related papers (2024-03-27T17:20:39Z)
HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference. HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators. Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z)
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation. We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Exploring validation metrics for offline model-based optimisation with diffusion models [50.404829846182764]
In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle. While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples. This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation.
arXiv Detail & Related papers (2022-11-19T16:57:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.