Related papers: Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?

Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?

URL: http://arxiv.org/abs/2507.08232v1
Date: Fri, 11 Jul 2025 00:36:57 GMT
Title: Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?
Authors: KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar,
Abstract summary: Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs)<n>We collect a dataset of 489 items from the National Assessment of Educational Progress (NAEP) covering mathematics and reading comprehension in grades 4, 8, and 12.<n>We apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations.
Score: 8.558834738072363
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs) and in piloting test questions. However, to what extent these proxy students accurately emulate the behavior and characteristics of real students remains an open question. To investigate this, we collected a dataset of 489 items from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension in grades 4, 8, and 12. We then apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations. Our findings reveal that, without guidance, strong general-purpose models consistently outperform the average student at every grade, while weaker or domain-mismatched models may align incidentally. Using grade-enforcement prompts changes models' performance, but whether they align with the average grade-level student remains highly model- and prompt-specific: no evaluated model-prompt pair fits the bill across subjects and grades, underscoring the need for new training and evaluation strategies. We conclude by providing guidelines for the selection of viable proxies based on our findings.

Related papers

Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z)
SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction [41.25292844733891]
We present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability.<n>We show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.
arXiv Detail & Related papers (2025-07-07T15:41:38Z)
Investigating Pedagogical Teacher and Student LLM Agents: Genetic Adaptation Meets Retrieval Augmented Generation Across Learning Style [16.985943868964394]
Effective teaching requires adapting instructional strategies to accommodate the diverse cognitive and behavioral profiles of students.<n>This paper introduces a novel simulation framework that integrates heterogeneous student agents with a self-optimizing teacher agent.<n>Our results highlight the potential of LLM-driven simulations to inform adaptive teaching practices and provide a testbed for training human educators in data-driven environments.
arXiv Detail & Related papers (2025-05-25T14:45:35Z)
Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study [0.0]
Large Language Models (LLMs) hold promise as dynamic instructional aids.<n>Yet, it remains unclear whether LLMs can replicate the adaptivity of intelligent tutoring systems (ITS)
arXiv Detail & Related papers (2025-04-07T23:57:32Z)
Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data.<n>Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios.<n>Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z)
LLM-based Cognitive Models of Students with Misconceptions [55.29525439159345]
This paper investigates whether Large Language Models (LLMs) can be instruction-tuned to meet this dual requirement. We introduce MalAlgoPy, a novel Python library that generates datasets reflecting authentic student solution patterns. Our insights enhance our understanding of AI-based student models and pave the way for effective adaptive learning systems.
arXiv Detail & Related papers (2024-10-16T06:51:09Z)
Learning to Love Edge Cases in Formative Math Assessment: Using the AMMORE Dataset and Chain-of-Thought Prompting to Improve Grading Accuracy [0.0]
This paper introduces AMMORE, a new dataset of 53,000 math open-response question-answer pairs from Rori. We conduct two experiments to evaluate the use of large language models (LLM) for grading challenging student answers.
arXiv Detail & Related papers (2024-09-26T14:51:40Z)
Evaluating the Impact of Advanced LLM Techniques on AI-Lecture Tutors for a Robotics Course [0.35132421583441026]
This study evaluates the performance of Large Language Models (LLMs) as an Artificial Intelligence-based tutor for a university course. In particular, different advanced techniques are utilized, such as prompt engineering, Retrieval-Augmented-Generation (RAG), and fine-tuning. Our findings indicate that RAG combined with prompt engineering significantly enhances model responses and produces better factual answers.
arXiv Detail & Related papers (2024-08-02T19:49:19Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Toward In-Context Teaching: Adapting Examples to Students' Misconceptions [54.82965010592045]
We introduce a suite of models and evaluation methods we call AdapT. AToM is a new probabilistic model for adaptive teaching that jointly infers students' past beliefs and optimize for the correctness of future beliefs. Our results highlight both the difficulty of the adaptive teaching task and the potential of learned adaptive models for solving it.
arXiv Detail & Related papers (2024-05-07T17:05:27Z)
Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning [104.58874584354787]
In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models.
arXiv Detail & Related papers (2023-01-27T18:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.