Related papers: Reasoning and Sampling-Augmented MCQ Difficulty Prediction via LLMs

Reasoning and Sampling-Augmented MCQ Difficulty Prediction via LLMs

URL: http://arxiv.org/abs/2503.08551v1
Date: Tue, 11 Mar 2025 15:39:43 GMT
Title: Reasoning and Sampling-Augmented MCQ Difficulty Prediction via LLMs
Authors: Wanyong Feng, Peter Tran, Stephen Sireci, Andrew Lan,
Abstract summary: We propose a novel, two-stage method to predict the difficulty of multiple-choice questions (MCQs)<n>First, to better estimate the complexity of each MCQ, we use large language models (LLMs) to augment the reasoning steps required to reach each option.<n>Second, to capture the plausibility of distractors, we sample knowledge levels from a distribution to account for variation among students responding to the MCQ.
Score: 1.749935196721634
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The difficulty of multiple-choice questions (MCQs) is a crucial factor for educational assessments. Predicting MCQ difficulty is challenging since it requires understanding both the complexity of reaching the correct option and the plausibility of distractors, i.e., incorrect options. In this paper, we propose a novel, two-stage method to predict the difficulty of MCQs. First, to better estimate the complexity of each MCQ, we use large language models (LLMs) to augment the reasoning steps required to reach each option. We use not just the MCQ itself but also these reasoning steps as input to predict the difficulty. Second, to capture the plausibility of distractors, we sample knowledge levels from a distribution to account for variation among students responding to the MCQ. This setup, inspired by item response theory (IRT), enable us to estimate the likelihood of students selecting each (both correct and incorrect) option. We align these predictions with their ground truth values, using a Kullback-Leibler (KL) divergence-based regularization objective, and use estimated likelihoods to predict MCQ difficulty. We evaluate our method on two real-world \emph{math} MCQ and response datasets with ground truth difficulty values estimated using IRT. Experimental results show that our method outperforms all baselines, up to a 28.3\% reduction in mean squared error and a 34.6\% improvement in the coefficient of determination. We also qualitatively discuss how our novel method results in higher accuracy in predicting MCQ difficulty.

Related papers

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
Reliable and Efficient Amortized Model-based Evaluation [57.6469531082784]
The average score across a wide range of benchmarks provides a signal that helps guide the use of language models in practice. A popular attempt to lower the cost is to compute the average score on a subset of the benchmark. This approach often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset. We train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost.
arXiv Detail & Related papers (2025-03-17T16:15:02Z)
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [90.88021670297664]
FINEREASON is a logic-puzzle benchmark for evaluation of large language models' reasoning capabilities.<n>We introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move.<n>We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
arXiv Detail & Related papers (2025-02-27T16:23:25Z)
CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought [10.166370877826486]
Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses. Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, which incurs high computational costs. We propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process.
arXiv Detail & Related papers (2025-02-24T14:48:06Z)
Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation [12.638577140117702]
We show that uncertainty features contribute substantially to difficulty prediction, where difficulty is inversely proportional to the number of students who can correctly answer a question.<n>In addition to showing the value of our approach, we also observe that our model achieves state-of-the-art results on the BEA publicly available dataset.
arXiv Detail & Related papers (2024-12-16T14:55:09Z)
Question Difficulty Ranking for Multiple-Choice Reading Comprehension [3.273958158967657]
Multiple-choice (MC) tests are an efficient method to assess English learners. It is useful for test creators to rank candidate MC questions by difficulty during exam curation. We explore automated approaches to rank MC questions by difficulty.
arXiv Detail & Related papers (2024-04-16T16:23:10Z)
Can multiple-choice questions really be useful in detecting the abilities of LLMs? [15.756543037102256]
Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy. We evaluate nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English.
arXiv Detail & Related papers (2024-03-26T14:43:48Z)
DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs) We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers. In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z)
Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering [63.12469700986452]
We introduce the concept of uncertainty-aware curriculum learning (CL) Here, uncertainty serves as the guiding principle for dynamically adjusting the difficulty. In practice, we seamlessly integrate the VideoQA model into our framework and conduct comprehensive experiments.
arXiv Detail & Related papers (2024-01-03T02:29:34Z)
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Mixed Large Language Model Signals for Science Question Answering [59.63860993280275]
Large Language Models (LLMs) have demonstrated exceptional performance in various Natural Language Processing (NLP) tasks. We propose a novel method termed T-SciQ that aims at teaching science question answering with LLM signals. Our approach achieves a new state-of-the-art performance on the ScienceQA benchmark, with an accuracy of 96.18%.
arXiv Detail & Related papers (2023-05-05T11:56:30Z)
Momentum Contrastive Pre-training for Question Answering [54.57078061878619]
MCROSS introduces a momentum contrastive learning framework to align the answer probability between cloze-like and natural query-passage sample pairs. Our method achieves noticeable improvement compared with all baselines in both supervised and zero-shot scenarios.
arXiv Detail & Related papers (2022-12-12T08:28:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.