Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate
Models for IRT Assessment
- URL: http://arxiv.org/abs/2403.01456v1
- Date: Sun, 3 Mar 2024 09:18:05 GMT
- Title: Controlling Cloze-test Question Item Difficulty with PLM-based Surrogate
Models for IRT Assessment
- Authors: Jingshen Zhang and Jiajun Xie and Xinying Qiu
- Abstract summary: We propose training pre-trained language models (PLMs) as surrogate models to enable item response theory (IRT) assessment.
We also propose two strategies to control the difficulty levels of both the gaps and the distractors using ranking rules to reduce invalid distractors.
- Score: 0.6138671548064356
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Item difficulty plays a crucial role in adaptive testing. However, few works
have focused on generating questions of varying difficulty levels, especially
for multiple-choice (MC) cloze tests. We propose training pre-trained language
models (PLMs) as surrogate models to enable item response theory (IRT)
assessment, avoiding the need for human test subjects. We also propose two
strategies to control the difficulty levels of both the gaps and the
distractors using ranking rules to reduce invalid distractors. Experimentation
on a benchmark dataset demonstrates that our proposed framework and methods can
effectively control and evaluate the difficulty levels of MC cloze tests.
Related papers
- Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA)
In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z) - Reliable and Efficient Amortized Model-based Evaluation [57.6469531082784]
The average score across a wide range of benchmarks provides a signal that helps guide the use of language models in practice.
A popular attempt to lower the cost is to compute the average score on a subset of the benchmark.
This approach often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset.
We train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost.
arXiv Detail & Related papers (2025-03-17T16:15:02Z) - DAST: Difficulty-Aware Self-Training on Large Language Models [68.30467836807362]
Large Language Models (LLM) self-training methods always under-sample on challenging queries.
This work proposes a difficulty-aware self-training framework that focuses on improving the quantity and quality of self-generated responses.
arXiv Detail & Related papers (2025-03-12T03:36:45Z) - Loss-Aware Curriculum Learning for Chinese Grammatical Error Correction [21.82403446634522]
Chinese grammatical error correction (CGEC) aims to detect and correct errors in the input Chinese sentences.
Current approaches ignore that correction difficulty varies across different instances and treat these samples equally.
We propose a multi-granularity Curriculum Learning framework to address this problem.
arXiv Detail & Related papers (2024-12-31T08:11:49Z) - A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts.
With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS)
Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements.
High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z) - Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks? [74.88417042125985]
We investigate various data-driven strategies that offer supervision data at different quality levels upon tasks of varying complexity.
We find that even when the outcome error rate for hard task supervision is high, training on such data can outperform perfectly correct supervision on easier subtasks.
Our results also reveal that supplementing hard task supervision with the corresponding subtask supervision can yield notable performance improvements.
arXiv Detail & Related papers (2024-10-27T17:55:27Z) - Step-by-Step Reasoning for Math Problems via Twisted Sequential Monte Carlo [55.452453947359736]
We introduce a novel verification method based on Twisted Sequential Monte Carlo (TSMC)
We apply TSMC to Large Language Models by estimating the expected future rewards at partial solutions.
This approach results in a more straightforward training target that eliminates the need for step-wise human annotations.
arXiv Detail & Related papers (2024-10-02T18:17:54Z) - Hybrid Classification-Regression Adaptive Loss for Dense Object Detection [19.180514552400883]
We propose a Hybrid Classification-Regression Adaptive Loss, termed as HCRAL.
We introduce the Residual of Classification and IoU (RCI) module for cross-task supervision, addressing task inconsistencies, and the Conditioning Factor (CF) to focus on difficult-to-train samples within each task.
We also introduce a new strategy named Expanded Adaptive Training Sample Selection (EATSS) to provide additional samples that exhibit classification and regression inconsistencies.
arXiv Detail & Related papers (2024-08-30T10:31:39Z) - Active Testing of Large Language Model via Multi-Stage Sampling [17.89896012553348]
AcTracer is an active testing framework tailored for large language models (LLMs)
It strategically selects a small subset of test data to achieve a nearly optimal performance estimation.
Our experiment results demonstrate that AcTracer achieves state-of-the-art performance compared to existing methods.
arXiv Detail & Related papers (2024-08-07T06:17:48Z) - ActiveGLAE: A Benchmark for Deep Active Learning with Transformers [5.326702806697265]
Deep active learning (DAL) seeks to reduce annotation costs by enabling the model to actively query instance annotations from which it expects to learn the most.
There is currently no standardized evaluation protocol for transformer-based language models in the field of DAL.
We propose the ActiveGLAE benchmark, a comprehensive collection of data sets and evaluation guidelines for assessing DAL.
arXiv Detail & Related papers (2023-06-16T13:07:29Z) - On Pitfalls of Test-Time Adaptation [82.8392232222119]
Test-Time Adaptation (TTA) has emerged as a promising approach for tackling the robustness challenge under distribution shifts.
We present TTAB, a test-time adaptation benchmark that encompasses ten state-of-the-art algorithms, a diverse array of distribution shifts, and two evaluation protocols.
arXiv Detail & Related papers (2023-06-06T09:35:29Z) - Transfer and Active Learning for Dissonance Detection: Addressing the
Rare-Class Challenge [7.61140479230184]
We propose and investigate transfer- and active learning solutions to the rare class problem of dissonance detection.
We perform experiments for a specific rare class problem: collecting language samples of cognitive dissonance from social media.
We find that probability-of-rare-class (PRC) is a simple and effective strategy to guide annotations and ultimately improve model accuracy.
arXiv Detail & Related papers (2023-05-03T23:29:05Z) - A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts [143.14128737978342]
Test-time adaptation, an emerging paradigm, has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions.
Recent progress in this paradigm highlights the significant benefits of utilizing unlabeled data for training self-adapted models prior to inference.
arXiv Detail & Related papers (2023-03-27T16:32:21Z) - Mixed-order self-paced curriculum learning for universal lesion
detection [36.198165949330566]
Self-paced curriculum learning (SCL) has demonstrated its great potential in computer vision, natural language processing, etc.
It implements easy-to-hard sampling based on online estimation of data difficulty.
Most SCL methods adopt a loss-based strategy of estimating data difficulty and deweighting the hard' samples in the early training stage.
arXiv Detail & Related papers (2023-02-09T14:52:44Z) - Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models.
We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER.
Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z) - Efficient Test-Time Model Adaptation without Forgetting [60.36499845014649]
Test-time adaptation seeks to tackle potential distribution shifts between training and testing data.
We propose an active sample selection criterion to identify reliable and non-redundant samples.
We also introduce a Fisher regularizer to constrain important model parameters from drastic changes.
arXiv Detail & Related papers (2022-04-06T06:39:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.