Related papers: No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

URL: http://arxiv.org/abs/2511.03051v1
Date: Tue, 04 Nov 2025 22:49:39 GMT
Title: No-Human in the Loop: Agentic Evaluation at Scale for Recommendation
Authors: Tao Zhang, Kehui Yao, Luyi Ma, Jiao Chen, Reza Yousefi Maragheh, Kai Zhao, Jianpeng Xu, Evren Korpeoglu, Sushant Kumar, Kannan Achan,
Abstract summary: evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines.<n>We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama.<n>Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting.
Score: 11.764010898952677
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Evaluating large language models (LLMs) as judges is increasingly critical for building scalable and trustworthy evaluation pipelines. We present ScalingEval, a large-scale benchmarking study that systematically compares 36 LLMs, including GPT, Gemini, Claude, and Llama, across multiple product categories using a consensus-driven evaluation protocol. Our multi-agent framework aggregates pattern audits and issue codes into ground-truth labels via scalable majority voting, enabling reproducible comparison of LLM evaluators without human annotation. Applied to large-scale complementary-item recommendation, the benchmark reports four key findings: (i) Anthropic Claude 3.5 Sonnet achieves the highest decision confidence; (ii) Gemini 1.5 Pro offers the best overall performance across categories; (iii) GPT-4o provides the most favorable latency-accuracy-cost tradeoff; and (iv) GPT-OSS 20B leads among open-source models. Category-level analysis shows strong consensus in structured domains (Electronics, Sports) but persistent disagreement in lifestyle categories (Clothing, Food). These results establish ScalingEval as a reproducible benchmark and evaluation protocol for LLMs as judges, with actionable guidance on scaling, reliability, and model family tradeoffs.

Related papers

Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment [26.786161923794115]
Large Language Models (LLMs) are increasingly used as evaluators of reasoning quality, yet their reliability and bias in payments-risk settings remain poorly understood.<n>We introduce a structured multi-evaluator framework for assessing LLM reasoning in Merchant Category Code (MCC)-based merchant risk assessment.
arXiv Detail & Related papers (2026-02-04T22:55:16Z)
Do LLM-judges Align with Human Relevance in Cranfield-style Recommender Evaluation? [40.49875426230813]
This paper investigates whether Large Language Models (LLMs) can serve as reliable automatic judges to address scalability challenges.<n>Using the ML-32M-ext Cranfield-style movie recommendation collection, we first examine the limitations of existing evaluation methodologies.<n>We find that incorporating richer item metadata and longer user histories improves alignment, and that LLM-judge yields high agreement with human-based rankings.
arXiv Detail & Related papers (2025-11-28T16:10:39Z)
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z)
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol for evaluation can significantly affect evaluation reliability and induce systematic biases.<n>We find that generator models can flip preferences by embedding distractor features.<n>We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.
arXiv Detail & Related papers (2025-04-20T19:05:59Z)
GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks [0.11458853556386796]
This paper establishes a benchmark for evaluating tool-calling capabilities of large language models (LLMs)<n>We assess eight commercial LLMs (Claude Sonnet 3.5 and 4, Claude Haiku 3.5, Gemini 2.0 Flash, Gemini 2.5 Pro Preview, GPT-4o, GPT-4.1 and o4-mini) using a simple tool-calling agent equipped with 23 geospatial functions.<n>Results show o4-mini and Claude 3.5 Sonnet achieve the best overall performance, OpenAI's GPT-4.1, GPT-4o and Google's Gemini 2.5 Pro Preview do not fall far behind, but the last two are more efficient in
arXiv Detail & Related papers (2025-03-23T16:20:14Z)
Language Model Preference Evaluation with Multiple Weak Evaluators [89.90733463933431]
We introduce PGED, a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results.<n>We demonstrate PGED's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning.
arXiv Detail & Related papers (2024-10-14T01:57:25Z)
Direct Judgement Preference Optimization [79.54459973726405]
We train large language models (LLMs) as generative judges to evaluate and critique other models' outputs.<n>We employ three approaches to collect the preference pairs for different use cases, each aimed at improving our generative judge from a different perspective.<n>Our model robustly counters inherent biases such as position and length bias, flexibly adapts to any evaluation protocol specified by practitioners, and provides helpful language feedback for improving downstream generator models.
arXiv Detail & Related papers (2024-09-23T02:08:20Z)
Toward Automatic Relevance Judgment using Vision--Language Models for Image--Text Retrieval Evaluation [56.49084589053732]
Vision--Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V, within a large-scale textitad hoc retrieval task tailored for multimedia content creation in a zero-shot fashion.
arXiv Detail & Related papers (2024-08-02T16:15:25Z)
Split and Merge: Aligning Position Biases in LLM-based Evaluators [22.265542509143756]
PORTIA is an alignment-based system designed to mimic human comparison strategies to calibrate position bias.<n>Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested.<n>It rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%.
arXiv Detail & Related papers (2023-09-29T14:38:58Z)
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale. We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units. We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.