NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty
- URL: http://arxiv.org/abs/2508.03294v1
- Date: Tue, 05 Aug 2025 10:12:38 GMT
- Title: NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty
- Authors: Leonidas Zotos, Ivo Pascal de Jong, Matias Valdenegro-Toro, Andreea Ioana Sburlea, Malvina Nissim, Hedderik van Rijn,
- Abstract summary: We compare various Large Language Model-based methods with three professors in their ability to estimate what percentage of students will give correct answers on True/False exam questions.<n>We obtained even better results using uncertainties of the LLMs solving the questions in a supervised learning setting, using only 42 training samples.
- Score: 15.12489035385276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating the difficulty of exam questions is essential for developing good exams, but professors are not always good at this task. We compare various Large Language Model-based methods with three professors in their ability to estimate what percentage of students will give correct answers on True/False exam questions in the areas of Neural Networks and Machine Learning. Our results show that the professors have limited ability to distinguish between easy and difficult questions and that they are outperformed by directly asking Gemini 2.5 to solve this task. Yet, we obtained even better results using uncertainties of the LLMs solving the questions in a supervised learning setting, using only 42 training samples. We conclude that supervised learning using LLM uncertainty can help professors better estimate the difficulty of exam questions, improving the quality of assessment.
Related papers
- Enhancing Student Learning with LLM-Generated Retrieval Practice Questions: An Empirical Study in Data Science Courses [0.0]
Large Language Models (LLMs) can generate retrieval practice questions in response to prompts.<n>Students exposed to LLM-generated retrieval practice achieved significantly higher knowledge retention, with an average accuracy of 89%.<n>These findings suggest that LLM-generated retrieval questions can effectively support student learning and may provide a scalable solution for integrating retrieval practice into real-time teaching.
arXiv Detail & Related papers (2025-07-08T03:23:19Z) - LLMs in the Classroom: Outcomes and Perceptions of Questions Written with the Aid of AI [0.0]
Students were unable to perceive whether questions were written with or without the aid of ChatGPT.<n>Student scores on LLM-authored questions were almost 9% lower.
arXiv Detail & Related papers (2025-03-23T22:01:49Z) - DAST: Difficulty-Aware Self-Training on Large Language Models [68.30467836807362]
Large Language Models (LLM) self-training methods always under-sample on challenging queries.<n>This work proposes a difficulty-aware self-training framework that focuses on improving the quantity and quality of self-generated responses.
arXiv Detail & Related papers (2025-03-12T03:36:45Z) - The Potential of Answer Classes in Large-scale Written Computer-Science Exams -- Vol. 2 [0.0]
In teacher training for secondary education, assessment guidelines are mandatory for every exam.<n>We apply this concept to a university exam with 462 students and 41 tasks.<n>For each task, instructors developed answer classes -- classes of expected responses.
arXiv Detail & Related papers (2024-12-12T10:20:39Z) - Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors [78.53699244846285]
Large language models (LLMs) present an opportunity to scale high-quality personalized education to all.
LLMs struggle to precisely detect student's errors and tailor their feedback to these errors.
Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions.
arXiv Detail & Related papers (2024-07-12T10:11:40Z) - LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.<n>Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.<n>We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - Three Questions Concerning the Use of Large Language Models to
Facilitate Mathematics Learning [4.376598435975689]
We discuss the challenges associated with employing large language models to enhance students' mathematical problem-solving skills.
LLMs can generate the wrong reasoning processes, and also exhibit difficulty in understanding the given questions' rationales when attempting to correct students' answers.
arXiv Detail & Related papers (2023-10-20T16:05:35Z) - Novice Learner and Expert Tutor: Evaluating Math Reasoning Abilities of
Large Language Models with Misconceptions [28.759189115877028]
We propose novel evaluations for mathematical reasoning capabilities of Large Language Models (LLMs) based on mathematical misconceptions.
Our primary approach is to simulate LLMs as a novice learner and an expert tutor, aiming to identify the incorrect answer to math question resulted from a specific misconception.
arXiv Detail & Related papers (2023-10-03T21:19:50Z) - MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties
Grounded in Math Reasoning Problems [74.73881579517055]
We propose a framework to generate such dialogues by pairing human teachers with a Large Language Model prompted to represent common student errors.
We describe how we use this framework to collect MathDial, a dataset of 3k one-to-one teacher-student tutoring dialogues.
arXiv Detail & Related papers (2023-05-23T21:44:56Z) - Neural Multi-Task Learning for Teacher Question Detection in Online
Classrooms [50.19997675066203]
We build an end-to-end neural framework that automatically detects questions from teachers' audio recordings.
By incorporating multi-task learning techniques, we are able to strengthen the understanding of semantic relations among different types of questions.
arXiv Detail & Related papers (2020-05-16T02:17:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.