Related papers: Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns

Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns

URL: http://arxiv.org/abs/2502.15140v1
Date: Fri, 21 Feb 2025 01:43:32 GMT
Title: Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns
Authors: Naiming Liu, Shashank Sonkar, Richard G. Baraniuk,
Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities in various educational tasks.<n>Their alignment with human learning patterns, particularly in predicting which incorrect options students are most likely to select in multiple-choice questions (MCQs) remains underexplored.
Score: 25.90420385230675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various educational tasks, yet their alignment with human learning patterns, particularly in predicting which incorrect options students are most likely to select in multiple-choice questions (MCQs), remains underexplored. Our work investigates the relationship between LLM generation likelihood and student response distributions in MCQs with a specific focus on distractor selections. We collect a comprehensive dataset of MCQs with real-world student response distributions to explore two fundamental research questions: (1). RQ1 - Do the distractors that students more frequently select correspond to those that LLMs assign higher generation likelihood to? (2). RQ2 - When an LLM selects a incorrect choice, does it choose the same distractor that most students pick? Our experiments reveals moderate correlations between LLM-assigned probabilities and student selection patterns for distractors in MCQs. Additionally, when LLMs make mistakes, they are more likley to select the same incorrect answers that commonly mislead students, which is a pattern consistent across both small and large language models. Our work provides empirical evidence that despite LLMs' strong performance on generating educational content, there remains a gap between LLM's underlying reasoning process and human cognitive processes in identifying confusing distractors. Our findings also have significant implications for educational assessment development. The smaller language models could be efficiently utilized for automated distractor generation as they demonstrate similar patterns in identifying confusing answer choices as larger language models. This observed alignment between LLMs and student misconception patterns opens new opportunities for generating high-quality distractors that complement traditional human-designed distractors.

Related papers

FoundationalASSIST: An Educational Dataset for Foundational Knowledge Tracing and Pedagogical Grounding of LLMs [0.8399688944263842]
FoundationalASSIST is the first English educational dataset providing the complete information needed for research on Large Language Models.<n>These 1.7 million interactions from 5,000 students enable research directions that were previously impossible to pursue.
arXiv Detail & Related papers (2026-01-20T17:47:30Z)
What Makes the Preferred Thinking Direction for LLMs in Multiple-choice Questions? [51.61404787000037]
Language models usually use left-to-right (L2R) autoregressive factorization.<n>We investigate whether alternative factorizations of the text distribution could be beneficial in some tasks.
arXiv Detail & Related papers (2025-02-25T18:30:25Z)
Affordably Fine-tuned LLMs Provide Better Answers to Course-specific MCQs [0.9217021281095907]
We study how Large Language Models (LLMs) answer multiple-choice questions (MCQs) with respect to hardware constraints and refinement techniques. We explore this space by using generic pre-trained LLMs to answer 162 undergraduate-level MCQs from a Programming Languages (PL) course.
arXiv Detail & Related papers (2025-01-10T11:44:35Z)
Embracing AI in Education: Understanding the Surge in Large Language Model Use by Secondary Students [53.20318273452059]
Large language models (LLMs) like OpenAI's ChatGPT have opened up new avenues in education.<n>Despite school restrictions, our survey of over 300 middle and high school students revealed that a remarkable 70% of students have utilized LLMs.<n>We propose a few ideas to address such issues, including subject-specific models, personalized learning, and AI classrooms.
arXiv Detail & Related papers (2024-11-27T19:19:34Z)
Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs. We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z)
Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena [23.264049073539663]
Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs) LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions.
arXiv Detail & Related papers (2024-06-11T17:59:47Z)
Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models [40.50115385623107]
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and reliable format in assessments and practices. One of the most important aspects of MCQs is the distractors, i.e., incorrect options that are designed to target common errors or misconceptions among real students. To date, the task of crafting high-quality distractors largely remains a labor and time-intensive process for teachers and learning content designers, which has limited scalability.
arXiv Detail & Related papers (2024-04-02T17:31:58Z)
Beyond Answers: Transferring Reasoning Capabilities to Smaller LLMs Using Multi-Teacher Knowledge Distillation [23.736611338497244]
TinyLLM is a new knowledge distillation paradigm to learn a small student LLM from multiple large teacher LLMs. We introduce an in-context example generator and a teacher-forcing Chain-of-Thought strategy to ensure that the rationales are accurate and grounded in contextually appropriate scenarios. Results show that TinyLLM can outperform large teacher LLMs significantly, despite a considerably smaller model size.
arXiv Detail & Related papers (2024-02-07T06:48:24Z)
Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z)
You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments [37.03210795084276]
We examine whether the current format of prompting Large Language Models elicits responses in a consistent and robust manner. Our experiments on 17 different LLMs reveal that even simple perturbations significantly downgrade a model's question-answering ability. Our results suggest that the currently widespread practice of prompting is insufficient to accurately and reliably capture model perceptions.
arXiv Detail & Related papers (2023-11-16T09:50:53Z)
Do LLMs exhibit human-like response biases? A case study in survey design [66.1850490474361]
We investigate the extent to which large language models (LLMs) reflect human response biases, if at all. We design a dataset and framework to evaluate whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular open and commercial LLMs generally fail to reflect human-like behavior.
arXiv Detail & Related papers (2023-11-07T15:40:43Z)
Democratizing Reasoning Ability: Tailored Learning from Large Language Model [97.4921006089966]
We propose a tailored learning approach to distill such reasoning ability to smaller LMs. We exploit the potential of LLM as a reasoning teacher by building an interactive multi-round learning paradigm. To exploit the reasoning potential of the smaller LM, we propose self-reflection learning to motivate the student to learn from self-made mistakes.
arXiv Detail & Related papers (2023-10-20T07:50:10Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.