Related papers: Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident

Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident

URL: http://arxiv.org/abs/2602.01015v1
Date: Sun, 01 Feb 2026 04:46:38 GMT
Title: Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident
Authors: Conrad Borchers, Jill-Jênn Vie, Roger Azevedo,
Abstract summary: Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments?<n>We evaluate LLMs as novices using 630 think-aloud utterances from chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context.<n>We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models' ability to predict step-level learner success.
Score: 0.8564319625930894
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments? Existing evaluations emphasize problem-solving accuracy, overlooking the fragmented and imperfect reasoning that characterizes human learning. We evaluate LLMs as novices using 630 think-aloud utterances from multi-step chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context. We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models' ability to predict step-level learner success. Although GPT-4.1 generates fluent and contextually appropriate continuations, its reasoning is systematically over-coherent, verbose, and less variable than human think-alouds. These effects intensify with a richer problem-solving context during prompting. Learner performance was consistently overestimated. These findings highlight epistemic limitations of simulating learning with LLMs. We attribute these limitations to LLM training data, including expert-like solutions devoid of expressions of affect and working memory constraints during problem solving. Our evaluation framework can guide future design of adaptive systems that more faithfully support novice learning and self-regulation using generative artificial intelligence.

Related papers

Learning to Learn from Language Feedback with Social Meta-Learning [17.85279270632852]
Large language models (LLMs) often struggle to learn from corrective feedback within a conversational context.<n>We draw inspiration from social meta-learning in humans - the process of learning how to learn from others.<n>We train LLMs to solicit and learn from language feedback in simulated pedagogical dialogues.
arXiv Detail & Related papers (2026-02-18T14:22:13Z)
TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning [26.680686158061192]
Reasoning is a fundamental capability of large language models (LLMs)<n>This paper introduces TextGames, a benchmark specifically crafted to assess LLMs through demanding text-based games.<n>Our findings reveal that although LLMs exhibit proficiency in addressing most easy and medium-level problems, they face significant challenges with more difficult tasks.
arXiv Detail & Related papers (2025-02-25T18:26:48Z)
Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.<n>We rigorously analyze both final answers and solution steps to identify reasoning failures.<n>We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z)
BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts [59.83547898874152]
BloomWise is a cognitively-inspired prompting technique for large language models (LLMs)<n>It is designed to enhance LLMs' performance on mathematical problem solving while making their solutions more explainable.
arXiv Detail & Related papers (2024-10-05T09:27:52Z)
Multi-Step Reasoning with Large Language Models, a Survey [8.647697652065718]
This article reviews the field of multi-step reasoning with large language models (LLMs)<n>We propose a taxonomy that identifies different ways to generate, evaluate, and control multi-step reasoning.<n>We find that multi-step reasoning approaches have progressed beyond math word problems, and can now successfully solve challenges in logic, games, and robotics.
arXiv Detail & Related papers (2024-07-16T08:49:35Z)
Resilience of Large Language Models for Noisy Instructions [38.25524275497566]
Large language models (LLMs) have emerged as powerful tools for interpreting human commands and generating text across various tasks. This study investigates the resilience of LLMs against five common types of disruptions including ASR (Automatic Speech Recognition) errors, OCR (Optical Character Recognition) errors, grammatical mistakes, and distractive content. Our findings reveal that while some LLMs show a degree of resistance to certain types of noise, their overall performance significantly suffers.
arXiv Detail & Related papers (2024-04-15T12:55:08Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z)
Democratizing Reasoning Ability: Tailored Learning from Large Language Model [97.4921006089966]
We propose a tailored learning approach to distill such reasoning ability to smaller LMs. We exploit the potential of LLM as a reasoning teacher by building an interactive multi-round learning paradigm. To exploit the reasoning potential of the smaller LM, we propose self-reflection learning to motivate the student to learn from self-made mistakes.
arXiv Detail & Related papers (2023-10-20T07:50:10Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Spoken Language Intelligence of Large Language Models for Language Learning [3.1964044595140217]
We focus on evaluating the efficacy of large language models (LLMs) in the realm of education.<n>We introduce a new multiple-choice question dataset to evaluate the effectiveness of LLMs in the aforementioned scenarios.<n>We also investigate the influence of various prompting techniques such as zero- and few-shot method.<n>We find that models of different sizes have good understanding of concepts in phonetics, phonology, and second language acquisition, but show limitations in reasoning for real-world problems.
arXiv Detail & Related papers (2023-08-28T12:47:41Z)
Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs) We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics. We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.