QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation
- URL: http://arxiv.org/abs/2503.05888v1
- Date: Fri, 07 Mar 2025 19:21:59 GMT
- Title: QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation
- Authors: Bang Nguyen, Tingting Du, Mengxia Yu, Lawrence Angrave, Meng Jiang,
- Abstract summary: We introduce test item analysis, a method frequently used to assess test question quality, into QG evaluation.<n>We construct pairs of candidate questions that differ in quality across dimensions such as topic coverage, item difficulty, item discrimination, and distractor efficiency.<n>We propose a novel QG evaluation framework, QG-SMS, which leverages Large Language Model for Student Modeling and Simulation.
- Score: 13.202947148434333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While the Question Generation (QG) task has been increasingly adopted in educational assessments, its evaluation remains limited by approaches that lack a clear connection to the educational values of test items. In this work, we introduce test item analysis, a method frequently used by educators to assess test question quality, into QG evaluation. Specifically, we construct pairs of candidate questions that differ in quality across dimensions such as topic coverage, item difficulty, item discrimination, and distractor efficiency. We then examine whether existing QG evaluation approaches can effectively distinguish these differences. Our findings reveal significant shortcomings in these approaches with respect to accurately assessing test item quality in relation to student performance. To address this gap, we propose a novel QG evaluation framework, QG-SMS, which leverages Large Language Model for Student Modeling and Simulation to perform test item analysis. As demonstrated in our extensive experiments and human evaluation study, the additional perspectives introduced by the simulated student profiles lead to a more effective and robust assessment of test items.
Related papers
- Analyzing Examinee Comments using DistilBERT and Machine Learning to Ensure Quality Control in Exam Content [0.0]
This study explores using Natural Language Processing (NLP) to analyze candidate comments for identifying problematic test items.
We developed and validated machine learning models that automatically identify relevant negative feedback.
arXiv Detail & Related papers (2025-04-08T22:08:37Z) - Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios.<n>Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models.<n>We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)
In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.
We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom [0.0]
Using data from Cloze tests administered to students in Brazil, WE models for Brazilian Portuguese (PT-BR) were employed to measure semantic similarity.
A comparative analysis between the WE models' scores and the judges' evaluations revealed that GloVe was the most effective model.
arXiv Detail & Related papers (2024-11-02T15:22:26Z) - Application of Large Language Models in Automated Question Generation: A Case Study on ChatGLM's Structured Questions for National Teacher Certification Exams [2.7363336723930756]
This study explores the application potential of the large language models (LLMs) ChatGLM in the automatic generation of structured questions for National Teacher Certification Exams (NTCE)
We guided ChatGLM to generate a series of simulated questions and conducted a comprehensive comparison with questions recollected from past examinees.
The research results indicate that the questions generated by ChatGLM exhibit a high level of rationality, scientificity, and practicality similar to those of the real exam questions.
arXiv Detail & Related papers (2024-08-19T13:32:14Z) - Assessing test artifact quality -- A tertiary study [1.7827643249624088]
We have carried out a systematic literature review to identify and analyze existing secondary studies on quality aspects of software testing artifacts.
We present an aggregation of the context dimensions and factors that can be used to characterize the environment in which the test case/suite quality is investigated.
arXiv Detail & Related papers (2024-02-14T19:31:57Z) - Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined
Levels [95.44077384918725]
We propose to teach large multi-modality models (LMMs) with text-defined rating levels instead of scores.
The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA) and video quality assessment (VQA) tasks.
arXiv Detail & Related papers (2023-12-28T16:10:25Z) - Test-Case Quality -- Understanding Practitioners' Perspectives [1.7827643249624088]
We present a quality model which consists of 11 test-case quality attributes.
We identify a misalignment in defining test-case quality among practitioners and between academia and industry.
arXiv Detail & Related papers (2023-09-28T19:10:01Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Benchmarking Foundation Models with Language-Model-as-an-Examiner [47.345760054595246]
We propose a novel benchmarking framework, Language-Model-as-an-Examiner.
The LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner.
arXiv Detail & Related papers (2023-06-07T06:29:58Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - Uncertainty-Driven Action Quality Assessment [11.958132175629368]
We propose a novel probabilistic model, named Uncertainty-Driven AQA (UD-AQA), to capture the diversity among multiple judge scores.<n>We generate the estimation of uncertainty for each prediction, which is employed to re-weight AQA regression loss.<n>Our proposed method achieves competitive results on three benchmarks including the Olympic events MTL-AQA and FineDiving, and the surgical skill JIGSAWS datasets.
arXiv Detail & Related papers (2022-07-29T07:21:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.