Related papers: gencat: Generative computerized adaptive testing

gencat: Generative computerized adaptive testing

URL: http://arxiv.org/abs/2602.20020v1
Date: Mon, 23 Feb 2026 16:28:46 GMT
Title: gencat: Generative computerized adaptive testing
Authors: Wanyong Feng, Andrew Lan,
Abstract summary: We propose GENCAT, a novel CAT framework that leverages Large Language Models for knowledge estimate and question selection.<n>First, we develop a Generative Item Response Theory (GIRT) model that enables us to estimate student knowledge from their open-ended responses and predict responses to unseen questions.<n>Second, we introduce three question selection algorithms that leverage the generative capabilities of the GIRT model, based on the uncertainty, linguistic diversity, and information of sampled student responses.
Score: 1.0162911785128765
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing computerized Adaptive Testing (CAT) frameworks are typically built on predicting the correctness of a student response to a question. Although effective, this approach fails to leverage textual information in questions and responses, especially for open-ended questions. In this work, we propose GENCAT (\textbf{GEN}erative \textbf{CAT}), a novel CAT framework that leverages Large Language Models for knowledge estimate and question selection. First, we develop a Generative Item Response Theory (GIRT) model that enables us to estimate student knowledge from their open-ended responses and predict responses to unseen questions. We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment. Second, we introduce three question selection algorithms that leverage the generative capabilities of the GIRT model, based on the uncertainty, linguistic diversity, and information of sampled student responses. Third, we conduct experiments on two real-world programming datasets and demonstrate that GENCAT outperforms existing CAT baselines, achieving an AUC improvement of up to 4.32\% in the key early testing stages.

Related papers

Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z)
Pointwise Mutual Information as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that the pointwise mutual information between a context and a question is an effective gauge for language model performance.<n>We propose two methods that use the pointwise mutual information between a document and a question as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation [9.390902237835457]
We propose a new method to measure the task-specific accuracy of Retrieval-Augmented Large Language Models (RAG) Evaluation is performed by scoring the RAG on an automatically-generated synthetic exam composed of multiple choice questions.
arXiv Detail & Related papers (2024-05-22T13:14:11Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
Addressing Selection Bias in Computerized Adaptive Testing: A User-Wise Aggregate Influence Function Approach [14.175555669521987]
We propose a user-wise aggregate influence function method to tackle the selection bias issue. Our intuition is to filter out users whose response data is heavily biased in an aggregate manner.
arXiv Detail & Related papers (2023-08-23T04:57:21Z)
Diverse and Faithful Knowledge-Grounded Dialogue Generation via Sequential Posterior Inference [82.28542500317445]
We present an end-to-end learning framework, termed Sequential Posterior Inference (SPI), capable of selecting knowledge and generating dialogues. Unlike other methods, SPI does not require the inference network or assume a simple geometry of the posterior distribution.
arXiv Detail & Related papers (2023-06-01T21:23:13Z)
QUADRo: Dataset and Models for QUestion-Answer Database Retrieval [97.84448420852854]
Given a database (DB) of question/answer (q/a) pairs, it is possible to answer a target question by scanning the DB for similar questions. We build a large scale DB of 6.3M q/a pairs, using public questions, and design a new system based on neural IR and a q/a pair reranker. We show that our DB-based approach is competitive with Web-based methods, i.e., a QA system built on top the BING search engine.
arXiv Detail & Related papers (2023-03-30T00:42:07Z)
Automatic Short Math Answer Grading via In-context Meta-learning [2.0263791972068628]
We study the problem of automatic short answer grading for students' responses to math questions. We use MathBERT, a variant of the popular language model BERT adapted to mathematical content, as our base model. Second, we use an in-context learning approach that provides scoring examples as input to the language model.
arXiv Detail & Related papers (2022-05-30T16:26:02Z)
BOBCAT: Bilevel Optimization-Based Computerized Adaptive Testing [3.756550107432323]
Computerized adaptive testing (CAT) refers to a form of tests that are personalized to every student/test taker. We propose BOBCAT, a Bilevel Optimization-Based framework for CAT to directly learn a data-driven question selection algorithm from training data.
arXiv Detail & Related papers (2021-08-17T00:40:23Z)
Quality meets Diversity: A Model-Agnostic Framework for Computerized Adaptive Testing [60.38182654847399]
Computerized Adaptive Testing (CAT) is emerging as a promising testing application in many scenarios. We propose a novel framework, namely Model-Agnostic Adaptive Testing (MAAT) for CAT solution.
arXiv Detail & Related papers (2021-01-15T06:48:50Z)
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks [133.93803565077337]
retrieval-augmented generation models combine pre-trained parametric and non-parametric memory for language generation. We show that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
arXiv Detail & Related papers (2020-05-22T21:34:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.