Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment
- URL: http://arxiv.org/abs/2602.05110v1
- Date: Wed, 04 Feb 2026 22:55:16 GMT
- Title: Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment
- Authors: Liang Wang, Junpeng Wang, Chin-chia Michael Yeh, Yan Zheng, Jiarui Sun, Xiran Fan, Xin Dai, Yujie Fan, Yiwei Cai,
- Abstract summary: Large Language Models (LLMs) are increasingly used as evaluators of reasoning quality, yet their reliability and bias in payments-risk settings remain poorly understood.<n>We introduce a structured multi-evaluator framework for assessing LLM reasoning in Merchant Category Code (MCC)-based merchant risk assessment.
- Score: 26.786161923794115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are increasingly used as evaluators of reasoning quality, yet their reliability and bias in payments-risk settings remain poorly understood. We introduce a structured multi-evaluator framework for assessing LLM reasoning in Merchant Category Code (MCC)-based merchant risk assessment, combining a five-criterion rubric with Monte-Carlo scoring to evaluate rationale quality and evaluator stability. Five frontier LLMs generate and cross-evaluate MCC risk rationales under attributed and anonymized conditions. To establish a judge-independent reference, we introduce a consensus-deviation metric that eliminates circularity by comparing each judge's score to the mean of all other judges, yielding a theoretically grounded measure of self-evaluation and cross-model deviation. Results reveal substantial heterogeneity: GPT-5.1 and Claude 4.5 Sonnet show negative self-evaluation bias (-0.33, -0.31), while Gemini-2.5 Pro and Grok 4 display positive bias (+0.77, +0.71), with bias attenuating by 25.8 percent under anonymization. Evaluation by 26 payment-industry experts shows LLM judges assign scores averaging +0.46 points above human consensus, and that the negative bias of GPT-5.1 and Claude 4.5 Sonnet reflects closer alignment with human judgment. Ground-truth validation using payment-network data shows four models exhibit statistically significant alignment (Spearman rho = 0.56 to 0.77), confirming that the framework captures genuine quality. Overall, the framework provides a replicable basis for evaluating LLM-as-a-judge systems in payment-risk workflows and highlights the need for bias-aware protocols in operational financial settings.
Related papers
- Who can we trust? LLM-as-a-jury for Comparative Assessment [42.32900791516691]
Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment.<n>LLMs judges vary substantially in performance across tasks and aspects, and their judgment probabilities may be biased and inconsistent.<n>We propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone.
arXiv Detail & Related papers (2026-02-18T17:04:02Z) - Uncertainty and Fairness Awareness in LLM-Based Recommendation Systems [3.937681476010311]
This paper studies how uncertainty and fairness evaluations affect the accuracy, consistency, and trustworthiness of large language models (LLMs)<n>We quantify predictive uncertainty (via entropy) and demonstrate that Google DeepMind's Gemini 1.5 Flash exhibits systematic unfairness for certain sensitive attributes.<n>We propose a novel uncertainty-aware evaluation methodology for RecLLMs, present empirical insights from deep uncertainty case studies, and introduce a personality profile-informed fairness benchmark.
arXiv Detail & Related papers (2026-01-31T17:18:13Z) - No-Human in the Loop: Agentic Evaluation at Scale for Recommendation [11.764010898952677]
evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines.<n>We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama.<n>Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting.
arXiv Detail & Related papers (2025-11-04T22:49:39Z) - Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems [32.83708359216193]
Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems.<n>This paper systematically investigates judgment biases in two LLM-as-a-judge models under the point-wise scoring setting.<n>We propose four potential mitigation strategies to ensure fair and reliable AI judging in practical communication scenarios.
arXiv Detail & Related papers (2025-10-14T12:52:29Z) - TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z) - Quantitative LLM Judges [60.773734899532336]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain.<n>The models are trained to improve the score of the original judge using its rationale and score.<n>Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z) - Is Your Model Fairly Certain? Uncertainty-Aware Fairness Evaluation for LLMs [7.197702136906138]
We propose an uncertainty-aware fairness metric, UCerF, to enable a fine-grained evaluation of model fairness.<n> observing data size, diversity, and clarity issues in current datasets, we introduce a new gender-occupation fairness evaluation dataset.<n>We establish a benchmark, using our metric and dataset, and apply it to evaluate the behavior of ten open-source AI systems.
arXiv Detail & Related papers (2025-05-29T20:45:18Z) - J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization [69.23273504123941]
We train judges to be robust to positional biases that arise in more complex evaluation settings.<n>We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work.<n>We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%.
arXiv Detail & Related papers (2025-05-19T16:50:35Z) - Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol for evaluation can significantly affect evaluation reliability and induce systematic biases.<n>We find that generator models can flip preferences by embedding distractor features.<n>We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.
arXiv Detail & Related papers (2025-04-20T19:05:59Z) - Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge [90.8674158031845]
We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses.<n>This process effectively guides LLM-as-a-Judge to provide a more detailed chain-of-thought (CoT) judgment.<n>Our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling.
arXiv Detail & Related papers (2025-02-18T03:31:06Z) - Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models [92.66784679667441]
Prometheus 2 is a more powerful evaluator LM that closely mirrors human and GPT-4 judgements.<n>It is capable of processing both direct assessment and pairwise ranking formats grouped with a user-defined evaluation criteria.<n>On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges.
arXiv Detail & Related papers (2024-05-02T17:59:35Z) - Split and Merge: Aligning Position Biases in LLM-based Evaluators [22.265542509143756]
PORTIA is an alignment-based system designed to mimic human comparison strategies to calibrate position bias.<n>Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested.<n>It rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%.
arXiv Detail & Related papers (2023-09-29T14:38:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.