Related papers: RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

URL: http://arxiv.org/abs/2602.12424v1
Date: Thu, 12 Feb 2026 21:28:46 GMT
Title: RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty
Authors: Ziqian Zhang, Xingjian Hu, Yue Huang, Kai Zhang, Ruoxi Chen, Yixin Liu, Qingsong Wen, Kaidi Xu, Xiangliang Zhang, Neil Zhenqiang Gong, Lichao Sun,
Abstract summary: RankLLM is a novel framework designed to quantify both question difficulty and model competency.<n>We evaluate 30 models on 35,550 questions across multiple domains.
Score: 102.02839046225468
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.

Related papers

Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus [11.916129241436584]
Large Language Models (LLMs) are increasingly deployed to generate educational content.<n>We benchmark ten proprietary and open-weight LLMs against official Item Response Theory (IRT) parameters for 1,031 questions.<n>Our results reveal a significant trade-off: while the best models achieve moderate rank correlation, they systematically underestimate difficulty and degrade significantly on multimodal items.
arXiv Detail & Related papers (2026-02-06T11:44:18Z)
JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation [13.831735556002426]
Small language models (SLMs) have shown promise on various reasoning tasks.<n>Their ability to judge the correctness of answers remains unclear compared to large language models (LLMs)
arXiv Detail & Related papers (2025-11-20T01:14:39Z)
TFRank: Think-Free Reasoning Enables Practical Pointwise LLM Ranking [21.930228130429573]
Reasoning-intensive ranking models built on Large Language Models (LLMs) have made notable progress.<n>Existing approaches often rely on large-scale LLMs and explicit Chain-of-Thought (CoT) reasoning.<n>We propose textbfTFRank, an efficient pointwise reasoning ranker based on small-scale LLMs.
arXiv Detail & Related papers (2025-08-13T06:47:58Z)
SKATE, a Scalable Tournament Eval: Weaker LLMs differentiate between stronger ones using verifiable challenges [2.184775414778289]
We introduce SKATE: a novel evaluation framework in which large language models (LLMs) compete by generating verifiable tasks for one another.<n>Our core is to treat evaluation as a game: models as both task-setters and solvers, incentivized to create questions which highlight their own strengths while exposing others' weaknesses.<n>Using a TrueSkill-based ranking system, we evaluate six LLMs and find that: (1) weaker models can reliably differentiate and score stronger ones, (2) LLM-based systems are capable of self-preferencing behavior, generating questions that align with their own capabilities, and (3) SKATE automatically surfaces fine
arXiv Detail & Related papers (2025-08-08T08:16:40Z)
Criteria-Based LLM Relevance Judgments [5.478764356647438]
Large Language Models (LLMs) provide a scalable solution by generating relevance labels directly through prompting.<n>We propose the Multi-Criteria framework for LLM-based relevance judgments, decomposing the notion of relevance into multiple criteria.<n>Our results demonstrate that Multi-Criteria judgments enhance the system ranking/leaderboard performance.
arXiv Detail & Related papers (2025-07-13T04:21:21Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Reliable and Efficient Amortized Model-based Evaluation [57.6469531082784]
The average score across a wide range of benchmarks provides a signal that helps guide the use of language models in practice.<n>A popular attempt to lower the cost is to compute the average score on a subset of the benchmark.<n>This approach often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset.<n>We train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost.
arXiv Detail & Related papers (2025-03-17T16:15:02Z)
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning [56.273799410256075]
The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability.
arXiv Detail & Related papers (2024-10-03T18:12:29Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures [57.886592207948844]
We propose MixEval, a new paradigm for establishing efficient, gold-standard evaluation by strategically mixing off-the-shelf benchmarks. It bridges (1) comprehensive and well-distributed real-world user queries and (2) efficient and fairly-graded ground-truth-based benchmarks, by matching queries mined from the web with similar queries from existing benchmarks.
arXiv Detail & Related papers (2024-06-03T05:47:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.