Related papers: LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse

LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse

URL: http://arxiv.org/abs/2602.09832v1
Date: Tue, 10 Feb 2026 14:38:13 GMT
Title: LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse
Authors: Bakhtawar Ahtisham, Kirk Vanacore, Zhuqian Zhou, Jinsook Lee, Rene F. Kizilcec,
Abstract summary: Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale.<n>We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions.<n>We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning.
Score: 0.18268488712787334
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model's assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.

Related papers

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching [50.65932158912512]
We propose a new causal reasoning benchmark, CausalFlip, to encourage the development of new large language models.<n>CaulFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations.<n>We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought supervision, and a proposed internalized causal reasoning approach.
arXiv Detail & Related papers (2026-02-23T18:06:15Z)
Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification [56.51953062869371]
DoVerifier is a symbolic verifier that checks whether causal expressions are derivable from a given causal graph using rules from do-calculus and probability theory.<n>Our evaluations on synthetic data and causal QA benchmarks show that DoVerifier more accurately captures semantic correctness of causal reasoning traces.
arXiv Detail & Related papers (2026-01-29T03:22:58Z)
Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks [8.210112631285666]
Large language models (LLMs) have demonstrated strong performance on formal language tasks.<n>We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages.<n>We show that models achieve perfect accuracy on factual questions and 84-90% on seen tasks, but accuracy drops sharply on unseen problems.
arXiv Detail & Related papers (2026-01-19T21:00:31Z)
Improving Symbolic Translation of Language Models for Logical Reasoning [14.474630644806723]
Small language models (LMs) often struggle with translating natural language (NL) into first-order logic (FOL)<n>Existing approaches typically rely on self-iteration to correct these errors, but such methods depend heavily on the capabilities of the underlying model.<n>We introduce incremental inference, which divides inference into two stages, predicate generation and FOL translation, providing greater control over model behavior.
arXiv Detail & Related papers (2026-01-14T12:47:14Z)
Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank [71.09032766271493]
Large language models (LLMs) are prone to errors and hallucinations.<n>How to check their outputs effectively and efficiently has become a critical problem in their applications.
arXiv Detail & Related papers (2025-10-28T11:01:10Z)
Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models [6.312798900093575]
Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning.<n>This paper focuses on the fundamental tradeoff between accuracy and overthinking.<n>We introduce the Overthinking Score, a harmonic-mean metric combining accuracy and token-efficiency for holistic model evaluation.
arXiv Detail & Related papers (2025-07-05T12:31:17Z)
Can Reasoning Help Large Language Models Capture Human Annotator Disagreement? [84.32752330104775]
Variation in human annotation (i.e., disagreements) is common in NLP.<n>We evaluate the influence of different reasoning settings on Large Language Model disagreement modeling.<n>Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling.
arXiv Detail & Related papers (2025-06-24T09:49:26Z)
Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents [3.5083201638203154]
Logic-Enhanced Language Model Agents (LELMA) is a framework that integrates large language models with formal logic.<n>LeLMA employs autoformalization to translate reasoning into logic representations, which are then used to assess logical validity.<n>LeLMA achieves high accuracy in error detection and improves reasoning correctness via self-refinement.
arXiv Detail & Related papers (2024-08-28T18:25:35Z)
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z)
Self-Contradictory Reasoning Evaluation and Detection [31.452161594896978]
We investigate self-contradictory (Self-Contra) reasoning, where the model reasoning does not support its answers. We find that LLMs often contradict themselves in reasoning tasks involving contextual information understanding or commonsense. We find that GPT-4 can detect Self-Contra with a 52.2% F1 score, much lower compared to 66.7% for humans.
arXiv Detail & Related papers (2023-11-16T06:22:17Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks. LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes. We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z)
On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.