BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration
- URL: http://arxiv.org/abs/2410.21033v1
- Date: Mon, 28 Oct 2024 13:54:10 GMT
- Title: BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration
- Authors: James Sharpnack, Kevin Hao, Phoebe Mulcaire, Klinton Bicknell, Geoff LaFlair, Kevin Yancey, Alina A. von Davier,
- Abstract summary: We present a complete framework for calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses.
We use AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT)
We propose the BanditCAT framework, a methodology motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT)
- Score: 7.261063083251448
- License:
- Abstract: In this paper, we present a complete framework for quickly calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. Calibration - learning item parameters in a test - is done using AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT), originally proposed in [Sharpnack et al., 2024]. AutoIRT trains a non-parametric AutoML grading model using item features, followed by an item-specific parametric model, which results in an explanatory IRT model. In our work, we use tabular AutoML tools (AutoGluon.tabular, [Erickson et al., 2020]) along with BERT embeddings and linguistically motivated NLP features. In this framework, we use Bayesian updating to obtain test taker ability posterior distributions for administration and scoring. For administration of our adaptive test, we propose the BanditCAT framework, a methodology motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT). The key insight lies in defining the bandit reward as the Fisher information for the selected item, given the latent test taker ability from IRT assumptions. We use Thompson sampling to balance between exploring items with different psychometric characteristics and selecting highly discriminative items that give more precise information about ability. To control item exposure, we inject noise through an additional randomization step before computing the Fisher information. This framework was used to initially launch two new item types on the DET practice test using limited training data. We outline some reliability and exposure metrics for the 5 practice test experiments that utilized this framework.
Related papers
- Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales [0.0]
We present a new model for multiple choice data, the monotone multiple choice (MMC) model, which we fit using autoencoders.
We demonstrate empirically that the MMC model outperforms the traditional nominal response IRT model in terms of fit.
arXiv Detail & Related papers (2024-10-02T12:33:16Z) - AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning [8.079755354261328]
We propose a multistage fitting procedure that is compatible with out-of-the-box Automated Machine Learning (AutoML) tools.
It is based on a Monte Carlo EM (MCEM) outer loop with a two stage inner loop, which trains a non-parametric AutoML grade model using item features followed by an item specific parametric model.
We show that the resulting model is typically more well, gets better predictive performance, and more accurate scores than existing methods.
arXiv Detail & Related papers (2024-09-13T13:36:51Z) - Test-Time Model Adaptation with Only Forward Passes [68.11784295706995]
Test-time adaptation has proven effective in adapting a given trained model to unseen test samples with potential distribution shifts.
We propose a test-time Forward-Optimization Adaptation (FOA) method.
FOA runs on quantized 8-bit ViT, outperforms gradient-based TENT on full-precision 32-bit ViT, and achieves an up to 24-fold memory reduction on ImageNet-C.
arXiv Detail & Related papers (2024-04-02T05:34:33Z) - A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts [143.14128737978342]
Test-time adaptation, an emerging paradigm, has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions.
Recent progress in this paradigm highlights the significant benefits of utilizing unlabeled data for training self-adapted models prior to inference.
arXiv Detail & Related papers (2023-03-27T16:32:21Z) - Robust Test-Time Adaptation in Dynamic Scenarios [9.475271284789969]
Test-time adaptation (TTA) intends to adapt the pretrained model to test distributions with only unlabeled test data streams.
We elaborate a Robust Test-Time Adaptation (RoTTA) method against the complex data stream in PTTA.
Our method is easy to implement, making it a good choice for rapid deployment.
arXiv Detail & Related papers (2023-03-24T10:19:14Z) - Autoencoded sparse Bayesian in-IRT factorization, calibration, and
amortized inference for the Work Disability Functional Assessment Battery [1.6114012813668934]
The Work Disability Functional Assessment Battery (WD-FAB) is a multidimensional item response theory (IRT) instrument for assessing work-related mental and physical function.
We develop a Bayesian hierarchical model for self-consistently performing the following simultaneous tasks.
We compare the resulting item discriminations obtained using the traditional posthoc method.
arXiv Detail & Related papers (2022-10-20T01:55:59Z) - TeST: Test-time Self-Training under Distribution Shift [99.68465267994783]
Test-Time Self-Training (TeST) is a technique that takes as input a model trained on some source data and a novel data distribution at test time.
We find that models adapted using TeST significantly improve over baseline test-time adaptation algorithms.
arXiv Detail & Related papers (2022-09-23T07:47:33Z) - Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language
Models [107.05966685291067]
We propose test-time prompt tuning (TPT) to learn adaptive prompts on the fly with a single test sample.
TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average.
In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
arXiv Detail & Related papers (2022-09-15T17:55:11Z) - TTAPS: Test-Time Adaption by Aligning Prototypes using Self-Supervision [70.05605071885914]
We propose a novel modification of the self-supervised training algorithm SwAV that adds the ability to adapt to single test samples.
We show the success of our method on the common benchmark dataset CIFAR10-C.
arXiv Detail & Related papers (2022-05-18T05:43:06Z) - Systematic Training and Testing for Machine Learning Using Combinatorial
Interaction Testing [0.0]
This paper demonstrates the systematic use of coverage for selecting and characterizing test and training sets for machine learning models.
The paper addresses prior criticism of coverage and provides a rebuttal which advocates the use of coverage metrics in machine learning applications.
arXiv Detail & Related papers (2022-01-28T21:33:31Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.