Related papers: Exploring AI-Enabled Test Practice, Affect, and Test Outcomes in Language Assessment

Exploring AI-Enabled Test Practice, Affect, and Test Outcomes in Language Assessment

URL: http://arxiv.org/abs/2508.17108v1
Date: Sat, 23 Aug 2025 18:41:30 GMT
Title: Exploring AI-Enabled Test Practice, Affect, and Test Outcomes in Language Assessment
Authors: Jill Burstein, Ramsey Cardwell, Ping-Ling Chuang, Allison Michalowski, Steven Nydick,
Abstract summary: Generative AI-driven, automated item generation (AIG) scales the creation of large item banks and multiple practice tests.<n>This study is the first large-scale study exploring the use of AIG-enabled practice tests in high-stakes language assessment.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Practice tests for high-stakes assessment are intended to build test familiarity, and reduce construct-irrelevant variance which can interfere with valid score interpretation. Generative AI-driven, automated item generation (AIG) scales the creation of large item banks and multiple practice tests, enabling repeated practice opportunities. We conducted a large-scale observational study (N = 25,969) using the Duolingo English Test (DET) -- a digital, high-stakes, computer-adaptive English language proficiency test to examine how increased access to repeated test practice relates to official DETscores, test-taker affect (e.g., confidence), and score-sharing for university admissions. To our knowledge, this is the first large-scale study exploring the use of AIG-enabled practice tests in high-stakes language assessment. Results showed that taking 1-3 practice tests was associated with better performance (scores), positive affect (e.g., confidence) toward the official DET, and increased likelihood of sharing scores for university admissions for those who also expressed positive affect. Taking more than 3 practice tests was related to lower performance, potentially reflecting washback -- i.e., using the practice test for purposes other than test familiarity, such as language learning or developing test-taking strategies. Findings can inform best practices regarding AI-supported test readiness. Study findings also raise new questions about test-taker preparation behaviors and relationships to test-taker performance, affect, and behaviorial outcomes.

Related papers

Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality [7.68863194266262]
Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs)<n>In this study, we evaluate the reliability and validity of human psychometric tests for three constructs: sexism, racism, and morality.<n>We find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity.
arXiv Detail & Related papers (2025-10-13T10:43:49Z)
Scaling Test-time Compute for LLM Agents [51.790752085445384]
Scaling test time compute has shown remarkable success in improving the reasoning abilities of large language models (LLMs)<n>In this work, we conduct the first systematic exploration of applying test-time scaling methods to language agents.
arXiv Detail & Related papers (2025-06-15T17:59:47Z)
TestAgent: An Adaptive and Intelligent Expert for Human Assessment [62.060118490577366]
We propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement.<n>TestAgent supports personalized question selection, captures test-takers' responses and anomalies, and provides precise outcomes through dynamic, conversational interactions.
arXiv Detail & Related papers (2025-06-03T16:07:54Z)
Existing Large Language Model Unlearning Evaluations Are Inconclusive [105.55899615056573]
We show that some evaluations introduce substantial new information into the model, potentially masking true unlearning performance.<n>We demonstrate that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation routines.<n>We propose two principles for future unlearning evaluations: minimal information injection and downstream task awareness.
arXiv Detail & Related papers (2025-05-31T19:43:00Z)
Gamifying Testing in IntelliJ: A Replicability Study [8.689182960457137]
Gamification is an emerging technique to enhance motivation and performance in traditionally unengaging tasks like software testing.<n>Previous studies have indicated that gamified systems have the potential to improve software testing processes by providing testers with achievements and feedback.<n>This paper aims to replicate and validate the effects of IntelliGame, a gamification plugin for IntelliJ IDEA to engage developers in writing and executing tests.
arXiv Detail & Related papers (2025-04-27T16:17:11Z)
Ever-Improving Test Suite by Leveraging Large Language Models [0.0]
Augmenting test suites with test cases that reflect the actual usage of the software system is extremely important to sustain the quality of long lasting software systems.<n>E-Test is an approach that incrementally augments a test suite with test cases that exercise behaviors that emerge in production and that are not been tested yet.
arXiv Detail & Related papers (2025-04-15T13:38:25Z)
NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom [0.0]
Using data from Cloze tests administered to students in Brazil, WE models for Brazilian Portuguese (PT-BR) were employed to measure semantic similarity. A comparative analysis between the WE models' scores and the judges' evaluations revealed that GloVe was the most effective model.
arXiv Detail & Related papers (2024-11-02T15:22:26Z)
Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures. We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z)
Implicit assessment of language learning during practice as accurate as explicit testing [0.5749787074942512]
We use Item Response Theory (IRT) in computer-aided language learning for assessment of student ability in two contexts. We first aim to replace exhaustive tests with efficient but accurate adaptive tests. Second, we explore whether we can accurately estimate learner ability directly from the context of practice with exercises, without testing.
arXiv Detail & Related papers (2024-09-24T14:40:44Z)
Responsible AI for Test Equity and Quality: The Duolingo English Test as a Case Study [0.06657612504660106]
The chapter presents a case study using the Duolingo English Test (DET), an AI-powered, high-stakes English language assessment. It discusses the DET RAI standards, their development and their relationship to domain-agnostic RAI principles. It provides examples of specific RAI practices, showing how these practices meaningfully address the ethical principles of validity and reliability, fairness, privacy and security, and transparency and accountability standards.
arXiv Detail & Related papers (2024-08-28T11:39:20Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [Experiment, Analysis & Benchmark] [65.11858854040544]
We perform a thorough experimental analysis of 12 popular language models over 17 established benchmark datasets. First, we assess their vectorization overhead for converting all input entities into dense embeddings vectors. Second, we investigate their blocking performance, performing a detailed scalability analysis, and comparing them with the state-of-the-art deep learning-based blocking method. Third, we conclude with their relative performance for both supervised and unsupervised matching.
arXiv Detail & Related papers (2023-04-24T08:53:54Z)
Cross-validation Confidence Intervals for Test Error [83.67415139421448]
This work develops central limit theorems for crossvalidation and consistent estimators of its variance under weak stability conditions on the learning algorithm. Results are the first of their kind for the popular choice of leave-one-out cross-validation.
arXiv Detail & Related papers (2020-07-24T17:40:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.