Related papers: SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges

SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges

URL: http://arxiv.org/abs/2508.06111v1
Date: Fri, 08 Aug 2025 08:16:40 GMT
Title: SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges
Authors: Dewi S. W. Gould, Bruno Mlodozeniec, Samuel F. Brown,
Abstract summary: We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating verifiable tasks for one another.<n>Our core is to treat evaluation as a game: models as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others' weaknesses.<n>Using a TrueSkill-based ranking system, we evaluate six LLMs and find that: (1) weaker models can reliably differentiate and score stronger ones, (2) LLM-based systems are capable of self-preferencing behavior, generating questions that align with their own capabilities, and (3) SKATE automatically surfaces fine
Score: 2.184775414778289
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating the capabilities and risks of foundation models is paramount, yet current methods demand extensive domain expertise, hindering their scalability as these models rapidly evolve. We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating and solving verifiable tasks for one another. Our core insight is to treat evaluation as a game: models act as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others' weaknesses. SKATE offers several key advantages, balancing scalability, open-endedness, and objectivity. It is fully automated, data-free, and scalable, requiring no human input or domain expertise. By using verifiable tasks rather than LLM judges, scoring is objective. Unlike domain-limited programmatically-generated benchmarks (e.g. chess-playing or spatial reasoning), having LLMs creatively pose challenges enables open-ended and scalable evaluation. As a proof of concept, we introduce LLM-set code-output-prediction (COP) challenges as a verifiable and extensible framework in which to test our approach. Using a TrueSkill-based ranking system, we evaluate six frontier LLMs and find that: (1) weaker models can reliably differentiate and score stronger ones, (2) LLM-based systems are capable of self-preferencing behavior, generating questions that align with their own capabilities, and (3) SKATE automatically surfaces fine-grained capability differences between models. Our findings are an important step towards general, scalable evaluation frameworks which can keep pace with LLM progress.

Related papers

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty [102.02839046225468]
RankLLM is a novel framework designed to quantify both question difficulty and model competency.<n>We evaluate 30 models on 35,550 questions across multiple domains.
arXiv Detail & Related papers (2026-02-12T21:28:46Z)
BotzoneBench: Scalable LLM Evaluation via Graded AI Anchors [9.224594551677374]
Large Language Models (LLMs) are increasingly deployed in interactive environments requiring strategic decision-making.<n>Recent game-based evaluations employ LLM-vs-LLM tournaments that produce relative rankings dependent on transient model pools.<n>Here we show that anchoring LLM evaluation to fixed hierarchies of skill-calibrated game Artificial Intelligence (AI) enables linear-time absolute skill measurement.
arXiv Detail & Related papers (2026-01-22T13:15:08Z)
LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models [13.713870642186254]
Large language models (LLMs) demonstrate remarkable capabilities across various tasks.<n>Existing evaluation methods suffer from issues such as data contamination, black-box operation, and subjective preference.<n>We propose a novel benchmark-free evaluation paradigm, LLM-Crowdsourced.
arXiv Detail & Related papers (2025-07-30T03:50:46Z)
Teach2Eval: An Indirect Evaluation Method for LLM by Judging How It Teaches [46.0474342507327]
We introduce Teach2Eval, an indirect evaluation framework inspired by the Feynman Technique.<n>Our method evaluates a model's multiple abilities to teach weaker student models to perform tasks effectively.
arXiv Detail & Related papers (2025-05-18T06:51:10Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.<n>LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.<n>Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
Evaluating and Advancing Multimodal Large Language Models in Perception Ability Lens [30.083110119139793]
We introduce textbfAbilityLens, a unified benchmark designed to evaluate MLLMs in six key perception abilities.<n>We identify the strengths and weaknesses of current main-stream MLLMs, highlighting stability patterns and revealing a notable performance gap between state-of-the-art open-source and closed-source models.
arXiv Detail & Related papers (2024-11-22T04:41:20Z)
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [65.92331309449015]
We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability.<n>Through an extensive evaluation of nine popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text. Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models [74.22729793816451]
Large Language Models (LLMs) have made significant progress in utilizing tools, but their ability is limited by API availability. We propose CREATOR, a novel framework that enables LLMs to create their own tools using documentation and code realization. We evaluate CREATOR on MATH and TabMWP benchmarks, respectively consisting of challenging math competition problems.
arXiv Detail & Related papers (2023-05-23T17:51:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.