Generating and Evaluating Tests for K-12 Students with Language Model
Simulations: A Case Study on Sentence Reading Efficiency
- URL: http://arxiv.org/abs/2310.06837v1
- Date: Tue, 10 Oct 2023 17:59:51 GMT
- Title: Generating and Evaluating Tests for K-12 Students with Language Model
Simulations: A Case Study on Sentence Reading Efficiency
- Authors: Eric Zelikman, Wanjing Anya Ma, Jasmine E. Tran, Diyi Yang, Jason D.
Yeatman, Nick Haber
- Abstract summary: This study focuses on tests of silent sentence reading efficiency, used to assess students' reading ability over time.
We propose to fine-tune large language models (LLMs) to simulate how previous students would have responded to unseen items.
We show the generated tests closely correspond to the original test's difficulty and reliability based on crowdworker responses.
- Score: 45.6224547703717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing an educational test can be expensive and time-consuming, as each
item must be written by experts and then evaluated by collecting hundreds of
student responses. Moreover, many tests require multiple distinct sets of
questions administered throughout the school year to closely monitor students'
progress, known as parallel tests. In this study, we focus on tests of silent
sentence reading efficiency, used to assess students' reading ability over
time. To generate high-quality parallel tests, we propose to fine-tune large
language models (LLMs) to simulate how previous students would have responded
to unseen items. With these simulated responses, we can estimate each item's
difficulty and ambiguity. We first use GPT-4 to generate new test items
following a list of expert-developed rules and then apply a fine-tuned LLM to
filter the items based on criteria from psychological measurements. We also
propose an optimal-transport-inspired technique for generating parallel tests
and show the generated tests closely correspond to the original test's
difficulty and reliability based on crowdworker responses. Our evaluation of a
generated test with 234 students from grades 2 to 8 produces test scores highly
correlated (r=0.93) to those of a standard test form written by human experts
and evaluated across thousands of K-12 students.
Related papers
- Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation [55.66090768926881]
We study the correspondence between decontextualized "trick tests" and evaluations that are more grounded in Realistic Use and Tangible Effects.
We compare three de-contextualized evaluations adapted from the current literature to three analogous RUTEd evaluations applied to long-form content generation.
We found no correspondence between trick tests and RUTEd evaluations.
arXiv Detail & Related papers (2024-02-20T01:49:15Z) - Manual Tests Do Smell! Cataloging and Identifying Natural Language Test
Smells [1.43994708364763]
Test smells indicate potential problems in the design and implementation of automated software tests.
This study aims to contribute to a catalog of test smells for manual tests.
arXiv Detail & Related papers (2023-08-02T19:05:36Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - Efficiently Measuring the Cognitive Ability of LLMs: An Adaptive Testing
Perspective [63.92197404447808]
Large language models (LLMs) have shown some human-like cognitive abilities.
We propose an adaptive testing framework for LLM evaluation.
This approach dynamically adjusts the characteristics of the test questions, such as difficulty, based on the model's performance.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Validation of massively-parallel adaptive testing using dynamic control
matching [0.0]
Modern businesses often run many A/B/n tests at the same time and in parallel, and package many content variations into the same messages.
This paper presents a method for disentangling the causal effects of the various tests under conditions of continuous test adaptation.
arXiv Detail & Related papers (2023-05-02T11:28:12Z) - Hybrid Intelligent Testing in Simulation-Based Verification [0.0]
Several millions of tests may be required to achieve coverage goals.
Coverage-Directed Test Selection learns from coverage feedback to bias testing towards the most effective tests.
Novelty-Driven Verification learns to identify and simulate stimuli that differ from previous stimuli.
arXiv Detail & Related papers (2022-05-19T13:22:08Z) - On the use of test smells for prediction of flaky tests [0.0]
flaky tests hamper the evaluation of test results and can increase costs.
Existing approaches based on the use of the test case vocabulary may be context-sensitive and prone to overfitting.
We investigate the use of test smells as predictors of flaky tests.
arXiv Detail & Related papers (2021-08-26T13:21:55Z) - Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse
Experts with Self-Supervision [85.07855130048951]
We study a more practical task setting, called test-agnostic long-tailed recognition, where the training class distribution is long-tailed.
We propose a new method, called Test-time Aggregating Diverse Experts (TADE), that trains diverse experts to excel at handling different test distributions.
We theoretically show that our method has provable ability to simulate unknown test class distributions.
arXiv Detail & Related papers (2021-07-20T04:10:31Z) - Empowering Language Understanding with Counterfactual Reasoning [141.48592718583245]
We propose a Counterfactual Reasoning Model, which mimics the counterfactual thinking by learning from few counterfactual samples.
In particular, we devise a generation module to generate representative counterfactual samples for each factual sample, and a retrospective module to retrospect the model prediction by comparing the counterfactual and factual samples.
arXiv Detail & Related papers (2021-06-06T06:36:52Z) - Learning by Passing Tests, with Application to Neural Architecture
Search [19.33620150924791]
We propose a novel learning approach called learning by passing tests.
A tester model creates increasingly more-difficult tests to evaluate a learner model.
The learner tries to continuously improve its learning ability so that it can successfully pass however difficult tests created by the tester.
arXiv Detail & Related papers (2020-11-30T18:33:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.