Question Difficulty Ranking for Multiple-Choice Reading Comprehension
- URL: http://arxiv.org/abs/2404.10704v1
- Date: Tue, 16 Apr 2024 16:23:10 GMT
- Title: Question Difficulty Ranking for Multiple-Choice Reading Comprehension
- Authors: Vatsal Raina, Mark Gales,
- Abstract summary: Multiple-choice (MC) tests are an efficient method to assess English learners.
It is useful for test creators to rank candidate MC questions by difficulty during exam curation.
We explore automated approaches to rank MC questions by difficulty.
- Score: 3.273958158967657
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multiple-choice (MC) tests are an efficient method to assess English learners. It is useful for test creators to rank candidate MC questions by difficulty during exam curation. Typically, the difficulty is determined by having human test takers trial the questions in a pretesting stage. However, this is expensive and not scalable. Therefore, we explore automated approaches to rank MC questions by difficulty. However, there is limited data for explicit training of a system for difficulty scores. Hence, we compare task transfer and zero-shot approaches: task transfer adapts level classification and reading comprehension systems for difficulty ranking while zero-shot prompting of instruction finetuned language models contrasts absolute assessment against comparative. It is found that level classification transfers better than reading comprehension. Additionally, zero-shot comparative assessment is more effective at difficulty ranking than the absolute assessment and even the task transfer approaches at question difficulty ranking with a Spearman's correlation of 40.4%. Combining the systems is observed to further boost the correlation.
Related papers
- Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset.
We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard)
We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances.
Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z) - Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality.
We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z) - Reliable and Efficient Amortized Model-based Evaluation [57.6469531082784]
The average score across a wide range of benchmarks provides a signal that helps guide the use of language models in practice.
A popular attempt to lower the cost is to compute the average score on a subset of the benchmark.
This approach often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset.
We train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost.
arXiv Detail & Related papers (2025-03-17T16:15:02Z) - Reasoning and Sampling-Augmented MCQ Difficulty Prediction via LLMs [1.749935196721634]
We propose a novel, two-stage method to predict the difficulty of multiple-choice questions (MCQs)
First, to better estimate the complexity of each MCQ, we use large language models (LLMs) to augment the reasoning steps required to reach each option.
Second, to capture the plausibility of distractors, we sample knowledge levels from a distribution to account for variation among students responding to the MCQ.
arXiv Detail & Related papers (2025-03-11T15:39:43Z) - BiRating -- Iterative averaging on a bipartite graph of Beat Saber scores, player skills, and map difficulties [0.0]
Difficulty estimation of Beat Saber maps is an interesting data analysis problem and valuable to the Beat Saber competitive scene.
We present a simple algorithm that iteratively averages player skill and map difficulty estimations in a bipartite graph of players and maps, connected by scores, using scores only as input.
arXiv Detail & Related papers (2025-02-27T04:07:53Z) - Are You Doubtful? Oh, It Might Be Difficult Then! Exploring the Use of Model Uncertainty for Question Difficulty Estimation [12.638577140117702]
We show that uncertainty features contribute substantially to difficulty prediction, where difficulty is inversely proportional to the number of students who can correctly answer a question.
In addition to showing the value of our approach, we also observe that our model achieves state-of-the-art results on the USMLE and CMCQRD publicly available datasets.
arXiv Detail & Related papers (2024-12-16T14:55:09Z) - Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks? [74.88417042125985]
We investigate various data-driven strategies that offer supervision data at different quality levels upon tasks of varying complexity.
We find that even when the outcome error rate for hard task supervision is high, training on such data can outperform perfectly correct supervision on easier subtasks.
Our results also reveal that supplementing hard task supervision with the corresponding subtask supervision can yield notable performance improvements.
arXiv Detail & Related papers (2024-10-27T17:55:27Z) - Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones? [65.43882564649721]
Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues.
We develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty.
We analyze the potential for improvement in consistency by relative consistency score.
arXiv Detail & Related papers (2024-06-18T17:25:47Z) - Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate
Models for IRT Assessment [0.6138671548064356]
We propose training pre-trained language models (PLMs) as surrogate models to enable item response theory (IRT) assessment.
We also propose two strategies to control the difficulty levels of both the gaps and the distractors using ranking rules to reduce invalid distractors.
arXiv Detail & Related papers (2024-03-03T09:18:05Z) - Assessing Distractors in Multiple-Choice Tests [10.179963650540056]
We propose metrics for the quality of distractors in multiple-choice reading comprehension tests.
Specifically, we define quality in terms of the incorrectness, plausibility and diversity of the distractor options.
arXiv Detail & Related papers (2023-11-08T09:37:09Z) - Analyzing Multiple-Choice Reading and Listening Comprehension Tests [0.0]
This work investigates how much of a contextual passage needs to be read in multiple-choice reading based on conversation transcriptions and listening comprehension tests to be able to work out the correct answer.
We find that automated reading comprehension systems can perform significantly better than random with partial or even no access to the context passage.
arXiv Detail & Related papers (2023-07-03T14:55:02Z) - RankCSE: Unsupervised Sentence Representations Learning via Learning to
Rank [54.854714257687334]
We propose a novel approach, RankCSE, for unsupervised sentence representation learning.
It incorporates ranking consistency and ranking distillation with contrastive learning into a unified framework.
An extensive set of experiments are conducted on both semantic textual similarity (STS) and transfer (TR) tasks.
arXiv Detail & Related papers (2023-05-26T08:27:07Z) - Integrating Rankings into Quantized Scores in Peer Review [61.27794774537103]
In peer review, reviewers are usually asked to provide scores for the papers.
To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed.
There are no standard procedure for using this ranking information and Area Chairs may use it in different ways.
We take a principled approach to integrate the ranking information into the scores.
arXiv Detail & Related papers (2022-04-05T19:39:13Z) - Difficulty-Aware Machine Translation Evaluation [19.973201669851626]
We propose a novel difficulty-aware machine translation evaluation metric.
A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function.
Our proposed method performs well even when all the MT systems are very competitive.
arXiv Detail & Related papers (2021-07-30T02:45:36Z) - Deep learning for sentence clustering in essay grading support [1.7259867886009057]
We introduce two datasets of undergraduate student essays in Finnish, manually annotated for salient arguments on the sentence level.
We evaluate several deep-learning embedding methods for their suitability to sentence clustering in support of essay grading.
arXiv Detail & Related papers (2021-04-23T12:32:51Z) - PiRank: Learning To Rank via Differentiable Sorting [85.28916333414145]
We propose PiRank, a new class of differentiable surrogates for ranking.
We show that PiRank exactly recovers the desired metrics in the limit of zero temperature.
arXiv Detail & Related papers (2020-12-12T05:07:36Z) - The World is Not Binary: Learning to Rank with Grayscale Data for
Dialogue Response Selection [55.390442067381755]
We show that grayscale data can be automatically constructed without human effort.
Our method employs off-the-shelf response retrieval models and response generation models as automatic grayscale data generators.
Experiments on three benchmark datasets and four state-of-the-art matching models show that the proposed approach brings significant and consistent performance improvements.
arXiv Detail & Related papers (2020-04-06T06:34:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.