Related papers: A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

URL: http://arxiv.org/abs/2507.08491v1
Date: Fri, 11 Jul 2025 11:16:01 GMT
Title: A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
Authors: David Schlangen, Sherzod Hakimov, Jonathan Jordan, Philipp Sadler,
Abstract summary: We present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use.<n>We describe how it can be used to benchmark one's own models (using a provided set of benchmark game instances in English)
Score: 18.149327897427234
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available. The second, best exemplified by the LM-arena, relies on (often self-selected) users bringing their own intents to a site that routes these to several models in parallel, among whose responses the user then selects their most preferred one. The former paradigm hence excels at control over what is tested, while the latter comes with higher ecological validity, testing actual use cases interactively. Recently, a third complementary paradigm has emerged that combines some of the strengths of these approaches, offering control over multi-turn, reference-free, repeatable interactions, while stressing goal-directedness: dialogue game based evaluation. While the utility of this approach has been shown by several projects, its adoption has been held back by the lack of a mature, easily re-usable implementation. In this paper, we present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use. We describe how it can be used to benchmark one's own models (using a provided set of benchmark game instances in English), as well as how easily the benchmark itself can be extended with new, tailor-made targeted tests.

Related papers

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z)
Aligning Language Model Benchmarks with Pairwise Preferences [15.427340427081843]
We introduce benchmark alignment, where we use limited amounts of information about model performance to automatically update offline benchmarks.<n>We then propose BenchAlign, which learns preference-aligned weight-ings for benchmark questions.<n>Our experiments show that our aligned benchmarks can accurately rank unseen models according to models of human preferences, even across different sizes.
arXiv Detail & Related papers (2026-02-02T23:11:09Z)
Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings [23.9553588103042]
We propose a item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves.<n>We show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity.<n>We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation.
arXiv Detail & Related papers (2025-10-30T11:28:58Z)
Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations [70.94563079082751]
E-commerce has exposed the limitations of traditional product retrieval systems in managing complex, multi-turn user interactions.<n>We propose a novel framework that introduces test-time scaling into conversational multimodal product retrieval.<n>Our approach builds on a generative retriever, further augmented with a test-time reranking mechanism that improves retrieval accuracy and better aligns results with evolving user intent throughout the dialogue.
arXiv Detail & Related papers (2025-08-25T15:38:56Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
RewardBench 2: Advancing Reward Model Evaluation [71.65938693914153]
Reward models are used throughout the post-training of language models to capture nuanced signals from preference data.<n>The community has begun establishing best practices for evaluating reward models.<n>This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark.
arXiv Detail & Related papers (2025-06-02T17:54:04Z)
From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z)
Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models [0.29687381456164]
VARCO Arena is a novel, cost-effective, and robust benchmarking approach for large language models.<n>Our results demonstrate that VARCO Arena not only produces reliable LLM rankings but also provides a scalable, adaptable solution for qualitative evaluation.
arXiv Detail & Related papers (2024-11-02T15:23:28Z)
TestAgent: A Framework for Domain-Adaptive Evaluation of LLMs via Dynamic Benchmark Construction and Exploratory Interaction [29.72874725703848]
Large language models (LLMs) are increasingly deployed to various vertical domains.<n>Current evaluation methods rely on static and resource-intensive datasets that are not aligned with real-world requirements.<n>We introduce two key concepts: textbfBenchmark+, which extends the traditional question-answer benchmark into a more flexible strategy-criterion'' format.<n>We propose textbftextscTestAgent, an agent-based evaluation framework that implements these concepts using retrieval-augmented generation and reinforcement learning.
arXiv Detail & Related papers (2024-10-15T11:20:42Z)
TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles [2.8839090723566296]
TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform. TurtleBench includes 1,532 user guesses along with the correctness of guesses after annotation. We thoroughly evaluated nine of the most advanced Large Language Models available today.
arXiv Detail & Related papers (2024-10-07T17:58:47Z)
HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly [34.205934899868346]
We introduce HELMET, a comprehensive benchmark encompassing seven diverse, application-centric categories.<n>We find that synthetic tasks like NIAH do not reliably predict downstream performance.<n>While most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when tasks require full-context reasoning.
arXiv Detail & Related papers (2024-10-03T17:20:11Z)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.<n>A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z)
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs) We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence. Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z)
Evaluating the Evaluators: Are Current Few-Shot Learning Benchmarks Fit for Purpose? [11.451691772914055]
This paper presents the first investigation into task-level evaluation. We measure the accuracy of performance estimators in the few-shot setting. We examine the reasons for the failure of evaluators usually thought of as being robust.
arXiv Detail & Related papers (2023-07-06T02:31:38Z)
UMSE: Unified Multi-scenario Summarization Evaluation [52.60867881867428]
Summarization quality evaluation is a non-trivial task in text summarization. We propose Unified Multi-scenario Summarization Evaluation Model (UMSE) Our UMSE is the first unified summarization evaluation framework engaged with the ability to be used in three evaluation scenarios.
arXiv Detail & Related papers (2023-05-26T12:54:44Z)
Frustratingly Simple Few-Shot Object Detection [98.42824677627581]
We find that fine-tuning only the last layer of existing detectors on rare classes is crucial to the few-shot object detection task. Such a simple approach outperforms the meta-learning methods by roughly 220 points on current benchmarks.
arXiv Detail & Related papers (2020-03-16T00:29:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.