Related papers: Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

URL: http://arxiv.org/abs/2601.09953v1
Date: Thu, 15 Jan 2026 00:25:01 GMT
Title: Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations
Authors: Christabel Acquaye, Yi Ting Huang, Marine Carpuat, Rachel Rudinger,
Abstract summary: We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of math questions for real-world students.<n>We simulate a "classroom" of 4th, 8th, or 12th grade students by prompting the LLM to role-play students of varying proficiency levels.<n>We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively.
Score: 36.23612429926861
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a "classroom" of 4th, 8th, or 12th grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively. In our simulations, we experiment with different "classroom sizes," showing tradeoffs between computation size and accuracy. We find that role-plays with named students improves predictions (compared to student ids), and stratifying names across gender and race further improves predictions. Our results show that LLMs with relatively weaker mathematical abilities (Gemma) actually yield better real-world difficulty predictions than mathematically stronger models (Llama and Qwen), further underscoring the suitability of open-source models for the task.

Related papers

PS$^2$: Parameterized Control for Fine-Grained Student Proficiency Simulation [37.112666030892115]
Student Simulation (PS$2$) is an unsupervised and parameterized model-level framework that simulates students with different proficiencies.<n>PS$2$ achieves finer-grained and consistent proficiency simulation compared to existing baselines.
arXiv Detail & Related papers (2026-01-31T18:27:56Z)
Estimating problem difficulty without ground truth using Large Language Model comparisons [4.599673637363014]
We propose a new method for estimating problem difficulty, LLM compare.<n>An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes.<n>Our work represents a significant step towards replacing time-consuming human annotations and synthetic data generation.
arXiv Detail & Related papers (2025-12-16T09:13:56Z)
SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors [58.87134689752605]
We introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation.<n>We show that even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size.<n>We demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning.
arXiv Detail & Related papers (2025-10-20T13:14:38Z)
Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension? [8.558834738072363]
Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs)<n>We collect a dataset of 489 items from the National Assessment of Educational Progress (NAEP) covering mathematics and reading comprehension in grades 4, 8, and 12.<n>We apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations.
arXiv Detail & Related papers (2025-07-11T00:36:57Z)
SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction [38.7828715471869]
We present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability.<n>We show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.
arXiv Detail & Related papers (2025-07-07T15:41:38Z)
Estimating Item Difficulty Using Large Language Models and Tree-Based Machine Learning Algorithms [0.0]
Estimating item difficulty through field-testing is often resource-intensive and time-consuming.<n>The present research examines the feasibility of using an Large Language Models (LLMs) to predict item difficulty for K-5 mathematics and reading assessment items.
arXiv Detail & Related papers (2025-04-09T00:04:07Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation. We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets. We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z)
Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs) We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics. We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.