Related papers: Estimating problem difficulty without ground truth using Large Language Model comparisons

Estimating problem difficulty without ground truth using Large Language Model comparisons

URL: http://arxiv.org/abs/2512.14220v1
Date: Tue, 16 Dec 2025 09:13:56 GMT
Title: Estimating problem difficulty without ground truth using Large Language Model comparisons
Authors: Marthe Ballon, Andres Algaba, Brecht Verbeken, Vincent Ginis,
Abstract summary: We propose a new method for estimating problem difficulty, LLM compare.<n>An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes.<n>Our work represents a significant step towards replacing time-consuming human annotations and synthetic data generation.
Score: 4.599673637363014
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advances in the finetuning of large language models (LLMs) have significantly improved their performance on established benchmarks, emphasizing the need for increasingly difficult, synthetic data. A key step in this data generation pipeline is a method for estimating problem difficulty. Current approaches, such as human calibration or performance-based scoring, fail to generalize to out-of-distribution problems, i.e. problems currently unsolvable by humans and LLMs, because they are not scalable, time-consuming, and ground truth dependent. Therefore, we propose a new method for estimating problem difficulty, LLM compare, that addresses these limitations. An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes. To validate our method, we first propose a conceptual framework that positions existing approaches on three orthogonal planes--construction, scale and dependence--identifying which quadrants a measure needs to occupy to score out-of-distribution problems. LLM compare naturally occupies all desirable quadrants as the first measure that is continuous and dynamic, model-agnostic and independent of ground truth information. As a second validation, we show that LLM compare demonstrates strong alignment with human annotations: Pearson $r \geq 0.80$ for $n=1876$. Thirdly, we show that LLM compare is robust to hallucinations, with less than $6\%$ degradation in Pearson correlation for $10\%$ noise injection. Our work represents a significant step towards replacing time-consuming human annotations and synthetic data generation, and will be an important driver for curriculum design, model evaluation, and AI-assisted research ideation.

Related papers

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty [102.02839046225468]
RankLLM is a novel framework designed to quantify both question difficulty and model competency.<n>We evaluate 30 models on 35,550 questions across multiple domains.
arXiv Detail & Related papers (2026-02-12T21:28:46Z)
ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking [44.58919028628059]
Supervised learning relies on high-quality labeled data, but obtaining such data through human annotation is both expensive and time-consuming.<n>Recent work explores using large language models (LLMs) for annotation, but LLM-generated labels still fall short of human-level quality.<n>To address this problem, we propose the Supervise with Critical Thinking (ACT) data pipeline.
arXiv Detail & Related papers (2025-11-13T00:32:30Z)
LLMs Encode How Difficult Problems Are [4.990590622073335]
We investigate whether large language models encode problem difficulty in a way that aligns with human judgment.<n>We train linear probes across layers and token positions on 60 models, evaluating on mathematical and coding subsets of Easy2HardBench.
arXiv Detail & Related papers (2025-10-20T22:48:23Z)
The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations [33.65540900920885]
Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference.<n>We propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM.
arXiv Detail & Related papers (2025-09-16T09:38:41Z)
Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding [59.60915947702282]
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs)<n>Existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability.<n>We propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region.
arXiv Detail & Related papers (2025-09-08T17:36:21Z)
Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs [45.83245433138508]
Large language models (LLMs) have rapidly progressed into general-purpose agents capable of solving a broad spectrum of tasks.<n>They apply fixed inference-time compute regardless of task complexity, often overthinking simple problems while underthinking hard ones.<n>This survey presents a comprehensive review of efficient test-time compute strategies, which aim to improve the computational efficiency of LLM reasoning.
arXiv Detail & Related papers (2025-07-02T18:27:42Z)
Estimating Item Difficulty Using Large Language Models and Tree-Based Machine Learning Algorithms [0.0]
Estimating item difficulty through field-testing is often resource-intensive and time-consuming.<n>The present research examines the feasibility of using an Large Language Models (LLMs) to predict item difficulty for K-5 mathematics and reading assessment items.
arXiv Detail & Related papers (2025-04-09T00:04:07Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search [76.54475437069395]
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information.<n>We propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior.
arXiv Detail & Related papers (2025-02-03T18:43:36Z)
Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.<n>In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z)
Online Cascade Learning for Efficient Inference over Streams [9.516197133796437]
Large Language Models (LLMs) have a natural role in answering complex queries about data streams. We propose online cascade learning, the first approach to address this challenge. We formulate the task of learning cascades online as an imitation-learning problem.
arXiv Detail & Related papers (2024-02-07T01:46:50Z)
Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs) We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.