Related papers: Probing for Arithmetic Errors in Language Models

Probing for Arithmetic Errors in Language Models

URL: http://arxiv.org/abs/2507.12379v1
Date: Wed, 16 Jul 2025 16:27:50 GMT
Title: Probing for Arithmetic Errors in Language Models
Authors: Yucheng Sun, Alessandro Stolfo, Mrinmaya Sachan,
Abstract summary: Internal activations in language models can be used to detect arithmetic errors.<n>We show that simple probes can accurately decode both the model's predicted output and the correct answer from hidden states.<n>We train lightweight error detectors that predict model correctness with over 90% accuracy.
Score: 86.8227317662622
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate whether internal activations in language models can be used to detect arithmetic errors. Starting with a controlled setting of 3-digit addition, we show that simple probes can accurately decode both the model's predicted output and the correct answer from hidden states, regardless of whether the model's output is correct. Building on this, we train lightweight error detectors that predict model correctness with over 90% accuracy. We then extend our analysis to structured chain-of-thought traces on addition-only GSM8K problems and find that probes trained on simple arithmetic generalize well to this more complex setting, revealing consistent internal representations. Finally, we demonstrate that these probes can guide selective re-prompting of erroneous reasoning steps, improving task accuracy with minimal disruption to correct outputs. Our findings suggest that arithmetic errors can be anticipated from internal activations alone, and that simple probes offer a viable path toward lightweight model self-correction.

Related papers

Project Aletheia: Verifier-Guided Distillation of Backtracking for Small Language Models [0.0]
Small Language Models (SLMs, under 10B parameters) are attractive for private, on-device deployment.<n>We introduce Verifier-Guided Distillation, a training protocol that transfers the process of error repair.<n>We show that latent verification behavior can emerge in small models, enabling them to occasionally stop, detect contradictions, and revise earlier assumptions.
arXiv Detail & Related papers (2026-01-14T14:39:08Z)
Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput [21.59519440154879]
We show that an outcome reward model (ORM) plays a crucial role in scaling verification through trading accuracy for speed.<n>We analyze the generate-prune-then-rank approach and show that it works by filtering out incorrect but highly ranked solutions.
arXiv Detail & Related papers (2025-06-11T17:58:21Z)
Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers [1.8874331450711404]
Existing work showed limited success in probing numeric values from models' representations.<n>We propose a novel probing technique that decodes numeric values from input embeddings with near-perfect accuracy.<n>We find that the embeddings' preciseness judged by our probe's accuracy explains a large portion of LM's errors in elementary arithmetic.
arXiv Detail & Related papers (2025-06-10T16:37:35Z)
Language Models Can Predict Their Own Behavior [29.566208688211876]
Language models (LMs) can exhibit specific behaviors,' such as a failure to follow alignment training, that we hope to detect and react to during deployment.<n>We show that probes trained on the internal representation of input tokens alone can predict a wide range of eventual behaviors over the entire output sequence.<n>An early warning system built on the probes reduces jailbreaking by 91%.
arXiv Detail & Related papers (2025-02-18T23:13:16Z)
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It [23.803612556616685]
We present a mechanistic analysis of error detection in large language models (LLMs)<n>Through circuit analysis, we identify the computational subgraphs responsible for detecting arithmetic errors across four smaller-sized LLMs.<n>Our findings reveal that all models heavily rely on $textitconsistency heads$--attention heads that assess surface-level alignment of numerical values in arithmetic solutions.
arXiv Detail & Related papers (2025-02-17T13:00:44Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)<n>RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation.<n>Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems [47.753284211200665]
We focus on understanding the usefulness of incorporating "error-correction" data directly into the pretraining stage. This data consists of erroneous solution steps immediately followed by their corrections. We show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy.
arXiv Detail & Related papers (2024-08-29T06:49:20Z)
Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models [5.463333911506443]
We aim to enhance the self-checking capabilities of large language models (LLMs) by constructing training data for checking tasks. We propose a specialized checking format called "Step CoT Check" Experiments demonstrate that fine-tuning with the "Step CoT Check" format significantly improves the self-checking and self-correction abilities of LLMs.
arXiv Detail & Related papers (2024-02-20T14:23:23Z)
Understanding and Mitigating Classification Errors Through Interpretable Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors. We propose to discover those patterns of tokens that distinguish correct and erroneous predictions. We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z)
Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth. We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way. We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z)
Label-Descriptive Patterns and their Application to Characterizing Classification Errors [31.272875287136426]
State-of-the-art deep learning methods achieve human-like performance on many tasks, but make errors nevertheless. Characterizing these errors in easily interpretable terms gives insight into whether a model is prone to making systematic errors, but also gives a way to act and improve the model. In this paper we propose a method that allows us to do so for arbitrary classifiers by mining a small set of patterns that together succinctly describe the input data that is partitioned according to correctness of prediction.
arXiv Detail & Related papers (2021-10-18T19:42:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.