Related papers: Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?

Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?

URL: http://arxiv.org/abs/2407.05327v1
Date: Sun, 7 Jul 2024 10:48:04 GMT
Title: Can Model Uncertainty Function as a Proxy for Multiple-Choice Question Item Difficulty?
Authors: Leonidas Zotos, Hedderik van Rijn, Malvina Nissim,
Abstract summary: We leverage an aspect of generative large models which might be seen as a weakness when answering questions. We explore correlations between two different metrics of uncertainty, and the actual student response distribution.
Score: 12.638577140117702
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Estimating the difficulty of multiple-choice questions would be great help for educators who must spend substantial time creating and piloting stimuli for their tests, and for learners who want to practice. Supervised approaches to difficulty estimation have yielded to date mixed results. In this contribution we leverage an aspect of generative large models which might be seen as a weakness when answering questions, namely their uncertainty, and exploit it towards exploring correlations between two different metrics of uncertainty, and the actual student response distribution. While we observe some present but weak correlations, we also discover that the models' behaviour is different in the case of correct vs wrong answers, and that correlations differ substantially according to the different question types which are included in our fine-grained, previously unused dataset of 451 questions from a Biopsychology course. In discussing our findings, we also suggest potential avenues to further leverage model uncertainty as an additional proxy for item difficulty.

Related papers

Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation [12.638577140117702]
We show that uncertainty features contribute substantially to difficulty prediction, where difficulty is inversely proportional to the number of students who can correctly answer a question. In addition to showing the value of our approach, we also observe that our model achieves state-of-the-art results on the USMLE and CMCQRD publicly available datasets.
arXiv Detail & Related papers (2024-12-16T14:55:09Z)
DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction [53.803276766404494]
Existing methods, which gauge a model's uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty. We propose a novel method, DiverseAgentEntropy, for evaluating a model's uncertainty using multi-agent interaction. Our method offers a more accurate prediction of the model's reliability and further detects hallucinations, outperforming other self-consistency-based methods.
arXiv Detail & Related papers (2024-12-12T18:52:40Z)
Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning [0.0]
Large Language Models (LLMs) have gained significant popularity in recent years for their ability to answer questions in various fields. We introduce an analysis for evaluating the performance of popular open-source LLMs. We focus on the relationship between answer accuracy and variability in topics related to physics.
arXiv Detail & Related papers (2024-11-18T13:42:13Z)
A dataset of questions on decision-theoretic reasoning in Newcomb-like problems [10.826981264871655]
We introduce a dataset of natural-language questions in the decision theory of so-called Newcomb-like problems. Some ways of reasoning about Newcomb-like problems may allow for greater cooperation between models.
arXiv Detail & Related papers (2024-11-15T21:19:04Z)
Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness [106.52630978891054]
We present a taxonomy of uncertainty specific to vision-language AI systems. We also introduce a new metric confidence-weighted accuracy, that is well correlated with both accuracy and calibration error.
arXiv Detail & Related papers (2024-07-02T04:23:54Z)
Improving the Reliability of Large Language Models by Leveraging Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination" We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z)
A Tale Of Two Long Tails [4.970364068620608]
We identify examples the model is uncertain about and characterize the source of said uncertainty. We investigate whether the rate of learning in the presence of additional information differs between atypical and noisy examples. Our results show that well-designed interventions over the course of training can be an effective way to characterize and distinguish between different sources of uncertainty.
arXiv Detail & Related papers (2021-07-27T22:49:59Z)
A New Score for Adaptive Tests in Bayesian and Credal Networks [64.80185026979883]
A test is adaptive when its sequence and number of questions is dynamically tuned on the basis of the estimated skills of the taker. We present an alternative family of scores, based on the mode of the posterior probabilities, and hence easier to explain.
arXiv Detail & Related papers (2021-05-25T20:35:42Z)
Exploring Bayesian Deep Learning for Urgent Instructor Intervention Need in MOOC Forums [58.221459787471254]
Massive Open Online Courses (MOOCs) have become a popular choice for e-learning thanks to their great flexibility. Due to large numbers of learners and their diverse backgrounds, it is taxing to offer real-time support. With the large volume of posts and high workloads for MOOC instructors, it is unlikely that the instructors can identify all learners requiring intervention. This paper explores for the first time Bayesian deep learning on learner-based text posts with two methods: Monte Carlo Dropout and Variational Inference.
arXiv Detail & Related papers (2021-04-26T15:12:13Z)
Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering. Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z)
Learning with Instance Bundles for Reading Comprehension [61.823444215188296]
We introduce new supervision techniques that compare question-answer scores across multiple related instances. Specifically, we normalize these scores across various neighborhoods of closely contrasting questions and/or answers. We empirically demonstrate the effectiveness of training with instance bundles on two datasets.
arXiv Detail & Related papers (2021-04-18T06:17:54Z)
R2DE: a NLP approach to estimating IRT parameters of newly generated questions [3.364554138758565]
R2DE is a model capable of assessing newly generated multiple-choice questions by looking at the text of the question. In particular, it can estimate the difficulty and the discrimination of each question.
arXiv Detail & Related papers (2020-01-21T14:31:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.