Related papers: Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus

Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus

URL: http://arxiv.org/abs/2602.06631v1
Date: Fri, 06 Feb 2026 11:44:18 GMT
Title: Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus
Authors: Thiago Brant, Julien Kühn, Jun Pang,
Abstract summary: Large Language Models (LLMs) are increasingly deployed to generate educational content.<n>We benchmark ten proprietary and open-weight LLMs against official Item Response Theory (IRT) parameters for 1,031 questions.<n>Our results reveal a significant trade-off: while the best models achieve moderate rank correlation, they systematically underestimate difficulty and degrade significantly on multimodal items.
Score: 11.916129241436584
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Large Language Models (LLMs) are increasingly deployed to generate educational content, a critical safety question arises: can these models reliably estimate the difficulty of the questions they produce? Using Brazil's high-stakes ENEM exam as a testbed, we benchmark ten proprietary and open-weight LLMs against official Item Response Theory (IRT) parameters for 1,031 questions. We evaluate performance along three axes: absolute calibration, rank fidelity, and context sensitivity across learner backgrounds. Our results reveal a significant trade-off: while the best models achieve moderate rank correlation, they systematically underestimate difficulty and degrade significantly on multimodal items. Crucially, we find that models exhibit limited and inconsistent plasticity when prompted with student demographic cues, suggesting they are not yet ready for context-adaptive personalization. We conclude that LLMs function best as calibrated screeners rather than authoritative oracles, supporting an "evaluation-before-generation" pipeline for responsible assessment design.

Related papers

Benchmarking at the Edge of Comprehension [38.43582342860192]
If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake.<n>We propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible.<n>Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims.
arXiv Detail & Related papers (2026-02-15T20:51:29Z)
RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty [102.02839046225468]
RankLLM is a novel framework designed to quantify both question difficulty and model competency.<n>We evaluate 30 models on 35,550 questions across multiple domains.
arXiv Detail & Related papers (2026-02-12T21:28:46Z)
Evaluating from Benign to Dynamic Adversarial: A Squid Game for Large Language Models [57.33350664910483]
We introduce Squid Game, a dynamic and adversarial evaluation environment with resource-constrained and asymmetric information settings.<n>We evaluate over 50 LLMs on Squid Game, presenting the largest behavioral evaluation study of general LLMs on dynamic adversarial scenarios.
arXiv Detail & Related papers (2025-11-12T06:06:29Z)
Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges [72.3356133063925]
The paradigm of large language models (LLMs) as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings.<n>Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals.
arXiv Detail & Related papers (2025-09-03T15:48:33Z)
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games [3.725822359130832]
Large Language Models (LLMs) are increasingly being explored as evaluators in serious games.<n>This study investigates the reliability of five small-scale LLMs when assessing player responses in textitEn-join, a game that simulates decision-making within energy communities.<n>Our results highlight the strengths and limitations of each model, revealing trade-offs between sensitivity, specificity, and overall performance.
arXiv Detail & Related papers (2025-04-13T10:46:13Z)
Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in Large Language Models [0.6091702876917281]
Large Language Models (LLMs) show remarkable proficiency in natural language tasks.<n>Overconfidence-misalignment between predicted confidence and true correctness poses significant risks in critical decision-making applications.<n>We present a comprehensive analysis on calibration in LLMs across nine LLMs and three factual Question-Answering datasets.
arXiv Detail & Related papers (2025-02-16T07:46:09Z)
Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs? [32.972545797220924]
Large Language Models (LLMs) are commonly evaluated using human-crafted benchmarks.<n> GAOKAO-Eval reveals that high scores still fail to truly reflect human-aligned capabilities.
arXiv Detail & Related papers (2024-12-13T11:38:10Z)
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.<n>We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z)
The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [1.3810901729134184]
Large Language Models (LLMs) excel at standardized tests while failing to demonstrate genuine language understanding and adaptability.<n>Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum.<n>We lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks.
arXiv Detail & Related papers (2024-12-02T20:49:21Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.