Related papers: Bayesian information theoretic model-averaging stochastic item selection for computer adaptive testing: compromise-free item exposure

Bayesian information theoretic model-averaging stochastic item selection for computer adaptive testing: compromise-free item exposure

URL: http://arxiv.org/abs/2504.15543v1
Date: Tue, 22 Apr 2025 02:45:16 GMT
Title: Bayesian information theoretic model-averaging stochastic item selection for computer adaptive testing: compromise-free item exposure
Authors: Joshua C. Chang, Edison Choe,
Abstract summary: We formulate the optimization problem for Computer Adaptive Testing (CAT) in terms of Bayesian information theory.<n>We find that our selector has superior properties in terms of both item exposure and test accuracy/efficiency.
Score: 0.9208007322096533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The goal of Computer Adaptive Testing (CAT) is to reliably estimate an individual's ability as modeled by an item response theory (IRT) instrument using only a subset of the instrument's items. A secondary goal is to vary the items presented across different testing sessions so that the sequence of items does not become overly stereotypical -- we want all items to have an exposure rate sufficiently far from zero. We formulate the optimization problem for CAT in terms of Bayesian information theory, where one chooses the item at each step based on the criterion of the ability model discrepancy -- the statistical distance between the ability estimate at the next step and the full-test ability estimate. This viewpoint of CAT naturally motivates a stochastic selection procedure that equates choosing the next item to sampling from a model-averaging ensemble ability model. Using the NIH Work Disability Functional Assessment Battery (WD-FAB), we evaluate our new methods in comparison to pre-existing methods found in the literature. We find that our stochastic selector has superior properties in terms of both item exposure and test accuracy/efficiency.

Related papers

How to Select Datapoints for Efficient Human Evaluation of NLG Models? [57.60407340254572]
We develop a suite of selectors to get the most informative datapoints for human evaluation.<n>We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection.<n>In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts.
arXiv Detail & Related papers (2025-01-30T10:33:26Z)
Improving Bias Correction Standards by Quantifying its Effects on Treatment Outcomes [54.18828236350544]
Propensity score matching (PSM) addresses selection biases by selecting comparable populations for analysis. Different matching methods can produce significantly different Average Treatment Effects (ATE) for the same task, even when meeting all validation criteria. To address this issue, we introduce a novel metric, A2A, to reduce the number of valid matches.
arXiv Detail & Related papers (2024-07-20T12:42:24Z)
Evaluation of human-model prediction difference on the Internet Scale of Data [32.7296837724399]
evaluating models on datasets often fails to capture their behavior when faced with unexpected and diverse types of inputs. We propose OmniInput, a novel approach to evaluate and compare NNs by the PR of an input space.
arXiv Detail & Related papers (2023-12-06T04:53:12Z)
Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs. Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z)
Addressing Selection Bias in Computerized Adaptive Testing: A User-Wise Aggregate Influence Function Approach [14.175555669521987]
We propose a user-wise aggregate influence function method to tackle the selection bias issue. Our intuition is to filter out users whose response data is heavily biased in an aggregate manner.
arXiv Detail & Related papers (2023-08-23T04:57:21Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Out-of-sample scoring and automatic selection of causal estimators [0.0]
We propose novel scoring approaches for both the CATE case and an important subset of instrumental variable problems. We implement that in an open source package that relies on DoWhy and EconML libraries.
arXiv Detail & Related papers (2022-12-20T08:29:18Z)
Autoencoded sparse Bayesian in-IRT factorization, calibration, and amortized inference for the Work Disability Functional Assessment Battery [1.6114012813668934]
The Work Disability Functional Assessment Battery (WD-FAB) is a multidimensional item response theory (IRT) instrument for assessing work-related mental and physical function. We develop a Bayesian hierarchical model for self-consistently performing the following simultaneous tasks. We compare the resulting item discriminations obtained using the traditional posthoc method.
arXiv Detail & Related papers (2022-10-20T01:55:59Z)
Contextual Active Model Selection [10.925932167673764]
We present an approach to actively select pre-trained models while minimizing labeling costs.<n>The objective is to adaptively select the best model to make a prediction while limiting label requests.<n>We propose CAMS, a contextual active model selection algorithm that relies on two novel components.
arXiv Detail & Related papers (2022-07-13T08:22:22Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Adaptive Sequential Design for a Single Time-Series [2.578242050187029]
We learn an optimal, unknown choice of the controlled components of a design in order to optimize the expected outcome. We adapt the randomization mechanism for future time-point experiments based on the data collected on the individual over time.
arXiv Detail & Related papers (2021-01-29T22:51:45Z)
Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data. There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups. We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.