Generating and Evaluating Tests for K-12 Students with Language Model
Simulations: A Case Study on Sentence Reading Efficiency
- URL: http://arxiv.org/abs/2310.06837v1
- Date: Tue, 10 Oct 2023 17:59:51 GMT
- Title: Generating and Evaluating Tests for K-12 Students with Language Model
Simulations: A Case Study on Sentence Reading Efficiency
- Authors: Eric Zelikman, Wanjing Anya Ma, Jasmine E. Tran, Diyi Yang, Jason D.
Yeatman, Nick Haber
- Abstract summary: This study focuses on tests of silent sentence reading efficiency, used to assess students' reading ability over time.
We propose to fine-tune large language models (LLMs) to simulate how previous students would have responded to unseen items.
We show the generated tests closely correspond to the original test's difficulty and reliability based on crowdworker responses.
- Score: 45.6224547703717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing an educational test can be expensive and time-consuming, as each
item must be written by experts and then evaluated by collecting hundreds of
student responses. Moreover, many tests require multiple distinct sets of
questions administered throughout the school year to closely monitor students'
progress, known as parallel tests. In this study, we focus on tests of silent
sentence reading efficiency, used to assess students' reading ability over
time. To generate high-quality parallel tests, we propose to fine-tune large
language models (LLMs) to simulate how previous students would have responded
to unseen items. With these simulated responses, we can estimate each item's
difficulty and ambiguity. We first use GPT-4 to generate new test items
following a list of expert-developed rules and then apply a fine-tuned LLM to
filter the items based on criteria from psychological measurements. We also
propose an optimal-transport-inspired technique for generating parallel tests
and show the generated tests closely correspond to the original test's
difficulty and reliability based on crowdworker responses. Our evaluation of a
generated test with 234 students from grades 2 to 8 produces test scores highly
correlated (r=0.93) to those of a standard test form written by human experts
and evaluated across thousands of K-12 students.
Related papers
- Test smells in LLM-Generated Unit Tests [11.517293765116307]
This study explores the diffusion of test smells in Large Language Models generated unit test suites.
We analyze a benchmark of 20,500 LLM-generated test suites produced by four models across five prompt engineering techniques.
We identify and analyze the prevalence and co-occurrence of various test smells in both human written and LLM-generated test suites.
arXiv Detail & Related papers (2024-10-14T15:35:44Z) - Do Test and Environmental Complexity Increase Flakiness? An Empirical Study of SAP HANA [47.29324864511411]
Flaky tests fail seemingly at random without changes to the code.
We study characteristics of tests and the test environment that potentially impact test flakiness.
arXiv Detail & Related papers (2024-09-16T07:52:09Z) - Multi-language Unit Test Generation using LLMs [6.259245181881262]
We describe a generic pipeline that incorporates static analysis to guide LLMs in generating compilable and high-coverage test cases.
We show how the pipeline can be applied to different programming languages, specifically Java and Python, and to complex software requiring environment mocking.
Our results demonstrate that LLM-based test generation, when guided by static analysis, can be competitive with, and even outperform, state-of-the-art test-generation techniques in coverage achieved.
arXiv Detail & Related papers (2024-09-04T21:46:18Z) - Leveraging Large Language Models for Enhancing the Understandability of Generated Unit Tests [4.574205608859157]
We introduce UTGen, which combines search-based software testing and large language models to enhance the understandability of automatically generated test cases.
We observe that participants working on assignments with UTGen test cases fix up to 33% more bugs and use up to 20% less time when compared to baseline test cases.
arXiv Detail & Related papers (2024-08-21T15:35:34Z) - A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites [1.4563527353943984]
Large Language Models (LLMs) have been applied to various aspects of software development.
We present AgoneTest: an automated system for generating test suites for Java projects.
arXiv Detail & Related papers (2024-08-14T23:02:16Z) - Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training [55.321010757641524]
A major public concern regarding the training of large language models (LLMs) is whether they abusing copyrighted online text.
Previous membership inference methods may be misled by similar examples in vast amounts of training data.
We propose an alternative textitinsert-and-detection methodology, advocating that web users and content platforms employ textbftextitunique identifiers.
arXiv Detail & Related papers (2024-03-23T06:36:32Z) - Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation [55.66090768926881]
We study the correspondence between decontextualized "trick tests" and evaluations that are more grounded in Realistic Use and Tangible Effects.
We compare three de-contextualized evaluations adapted from the current literature to three analogous RUTEd evaluations applied to long-form content generation.
We found no correspondence between trick tests and RUTEd evaluations.
arXiv Detail & Related papers (2024-02-20T01:49:15Z) - On the use of test smells for prediction of flaky tests [0.0]
flaky tests hamper the evaluation of test results and can increase costs.
Existing approaches based on the use of the test case vocabulary may be context-sensitive and prone to overfitting.
We investigate the use of test smells as predictors of flaky tests.
arXiv Detail & Related papers (2021-08-26T13:21:55Z) - Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse
Experts with Self-Supervision [85.07855130048951]
We study a more practical task setting, called test-agnostic long-tailed recognition, where the training class distribution is long-tailed.
We propose a new method, called Test-time Aggregating Diverse Experts (TADE), that trains diverse experts to excel at handling different test distributions.
We theoretically show that our method has provable ability to simulate unknown test class distributions.
arXiv Detail & Related papers (2021-07-20T04:10:31Z) - Empowering Language Understanding with Counterfactual Reasoning [141.48592718583245]
We propose a Counterfactual Reasoning Model, which mimics the counterfactual thinking by learning from few counterfactual samples.
In particular, we devise a generation module to generate representative counterfactual samples for each factual sample, and a retrospective module to retrospect the model prediction by comparing the counterfactual and factual samples.
arXiv Detail & Related papers (2021-06-06T06:36:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.