The Geometry of Self-Verification in a Task-Specific Reasoning Model
- URL: http://arxiv.org/abs/2504.14379v1
- Date: Sat, 19 Apr 2025 18:40:51 GMT
- Title: The Geometry of Self-Verification in a Task-Specific Reasoning Model
- Authors: Andrew Lee, Lihao Sun, Chris Wendler, Fernanda ViƩgas, Martin Wattenberg,
- Abstract summary: We train a model using DeepSeek R1's recipe on the CountDown task.<n>We do a top-down and bottom-up analysis to reverse-engineer how the model verifies its outputs.
- Score: 45.669264589017665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How do reasoning models verify their own answers? We study this question by training a model using DeepSeek R1's recipe on the CountDown task. We leverage the fact that preference tuning leads to mode collapse, resulting in a model that always produces highly structured and easily parse-able chain-of-thought sequences. With this setup, we do a top-down and bottom-up analysis to reverse-engineer how the model verifies its outputs. Our top-down analysis reveals Gated Linear Unit (GLU) weights encoding verification-related tokens, such as ``success'' or ``incorrect'', which activate according to the correctness of the model's reasoning steps. Our bottom-up analysis reveals that ``previous-token heads'' are mainly responsible for model verification. Our analyses meet in the middle: drawing inspiration from inter-layer communication channels, we use the identified GLU vectors to localize as few as three attention heads that can disable model verification, pointing to a necessary component of a potentially larger verification circuit.
Related papers
- CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding [55.33317649771575]
Embodied Reference Understanding involves predicting the object that a person in the scene is referring to through both pointing gesture and language.<n>We propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction.<n>We present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features.
arXiv Detail & Related papers (2025-07-29T15:00:21Z) - Adversarial Manipulation of Reasoning Models using Internal Representations [1.024113475677323]
We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply.<n>We show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates.<n>Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models.
arXiv Detail & Related papers (2025-07-03T20:51:32Z) - Incentivizing LLMs to Self-Verify Their Answers [20.2584779107763]
Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks.<n>We propose a framework that incentivizes LLMs to self-verify their own answers.<n>We train our self-verification models based on Qwen2.5-Math-7B and DeepSeek-R1-Distill-Qwen-1.5B.
arXiv Detail & Related papers (2025-06-02T06:54:29Z) - Mitigating Deceptive Alignment via Self-Monitoring [15.365589693661823]
We develop a framework that embeds a Self-Monitor inside the chain-of-thought process itself, named CoT Monitor+.<n>During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies.<n>The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals.
arXiv Detail & Related papers (2025-05-24T17:41:47Z) - Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps [3.8936716676293917]
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data.<n>We identify a critical parameter threshold (1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning.
arXiv Detail & Related papers (2025-02-21T00:48:32Z) - STRIVE: Structured Reasoning for Self-Improvement in Claim Verification [21.00145637520767]
We propose STRIVE: Structured Reasoning for Self-Improved Verification.
Our method introduces a structured reasoning design with Claim Decomposition, Entity Analysis, and Evidence Grounding Verification.
It is then applied to generate reasoning chains for all training examples, selecting only those that are correct and structurally sound for subsequent self-improvement training.
arXiv Detail & Related papers (2025-02-17T16:07:07Z) - Attuned to Change: Causal Fine-Tuning under Latent-Confounded Shifts [32.989526411946606]
Adapting to latent-confounded shifts remains a core challenge in modern AI.<n>One practical failure mode arises when fine-tuning pre-trained foundation models on confounded data.<n>We frame causal fine-tuning as an identification problem and pose an explicit causal model that decomposes inputs into low-level spurious features.
arXiv Detail & Related papers (2024-10-18T11:06:23Z) - AutoPSV: Automated Process-Supervised Verifier [10.283965168399158]
textbfAutomated textbfProcess-textbfSupervised textbfVerifier (textbftextscAutoPSV)
textscAutoPSV begins by training a verification model on the correctness of final answers.
We experimentally validate that the step-level confidence changes learned by the verification model trained on the final answer correctness can effectively identify errors in the reasoning steps.
arXiv Detail & Related papers (2024-05-27T03:44:24Z) - Explaining Pre-Trained Language Models with Attribution Scores: An
Analysis in Low-Resource Settings [32.03184402316848]
We analyze attribution scores extracted from prompt-based models w.r.t. plausibility and faithfulness.
We find that using the prompting paradigm yields more plausible explanations than fine-tuning the models in low-resource settings.
arXiv Detail & Related papers (2024-03-08T14:14:37Z) - Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space.
We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z) - GrOVe: Ownership Verification of Graph Neural Networks using Embeddings [13.28269672097063]
Graph neural networks (GNNs) have emerged as a state-of-the-art approach to model and draw inferences from large scale graph-structured data.
Prior work has shown that GNNs are prone to model extraction attacks.
We present GrOVe, a state-of-the-art GNN model fingerprinting scheme.
arXiv Detail & Related papers (2023-04-17T19:06:56Z) - Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks.
LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes.
We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z) - Interpretations Steered Network Pruning via Amortized Inferred Saliency
Maps [85.49020931411825]
Convolutional Neural Networks (CNNs) compression is crucial to deploying these models in edge devices with limited resources.
We propose to address the channel pruning problem from a novel perspective by leveraging the interpretations of a model to steer the pruning process.
We tackle this challenge by introducing a selector model that predicts real-time smooth saliency masks for pruned models.
arXiv Detail & Related papers (2022-09-07T01:12:11Z) - Certifiable 3D Object Pose Estimation: Foundations, Learning Models, and
Self-Training [23.802602957611676]
We consider a certifiable object pose estimation problem, where -- given a partial point cloud of an object -- the goal is to provide a certificate of correctness for the resulting estimate.
We propose C-3PO, a semantic-keypoint-based pose estimation model, augmented with the two certificates.
arXiv Detail & Related papers (2022-06-22T17:06:39Z) - Entailment Tree Explanations via Iterative Retrieval-Generation Reasoner [56.08919422452905]
We propose an architecture called Iterative Retrieval-Generation Reasoner (IRGR)
Our model is able to explain a given hypothesis by systematically generating a step-by-step explanation from textual premises.
We outperform existing benchmarks on premise retrieval and entailment tree generation, with around 300% gain in overall correctness.
arXiv Detail & Related papers (2022-05-18T21:52:11Z) - Probing Model Signal-Awareness via Prediction-Preserving Input
Minimization [67.62847721118142]
We evaluate models' ability to capture the correct vulnerability signals to produce their predictions.
We measure the signal awareness of models using a new metric we propose- Signal-aware Recall (SAR)
The results show a sharp drop in the model's Recall from the high 90s to sub-60s with the new metric.
arXiv Detail & Related papers (2020-11-25T20:05:23Z) - Explaining and Improving Model Behavior with k Nearest Neighbor
Representations [107.24850861390196]
We propose using k nearest neighbor representations to identify training examples responsible for a model's predictions.
We show that kNN representations are effective at uncovering learned spurious associations.
Our results indicate that the kNN approach makes the finetuned model more robust to adversarial inputs.
arXiv Detail & Related papers (2020-10-18T16:55:25Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z) - Robust Question Answering Through Sub-part Alignment [53.94003466761305]
We model question answering as an alignment problem.
We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets.
arXiv Detail & Related papers (2020-04-30T09:10:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.