Related papers: Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs

Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs

URL: http://arxiv.org/abs/2510.09794v1
Date: Fri, 10 Oct 2025 18:59:03 GMT
Title: Causality $\neq$ Decodability, and Vice Versa: Lessons from Interpreting Counting ViTs
Authors: Lianghuan Huang, Yingshan Chang,
Abstract summary: We investigate the relationship in vision transformers (ViTs) fine-tuned for object counting.<n>Using activation patching, we test the causal role of spatial and CLS tokens.<n>We train linear probes to assess the decodability of count information at different depths.
Score: 6.622603488436762
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mechanistic interpretability seeks to uncover how internal components of neural networks give rise to predictions. A persistent challenge, however, is disentangling two often conflated notions: decodability--the recoverability of information from hidden states--and causality--the extent to which those states functionally influence outputs. In this work, we investigate their relationship in vision transformers (ViTs) fine-tuned for object counting. Using activation patching, we test the causal role of spatial and CLS tokens by transplanting activations across clean-corrupted image pairs. In parallel, we train linear probes to assess the decodability of count information at different depths. Our results reveal systematic mismatches: middle-layer object tokens exert strong causal influence despite being weakly decodable, whereas final-layer object tokens support accurate decoding yet are functionally inert. Similarly, the CLS token becomes decodable in mid-layers but only acquires causal power in the final layers. These findings highlight that decodability and causality reflect complementary dimensions of representation--what information is present versus what is used--and that their divergence can expose hidden computational circuits.

Related papers

Imagination Helps Visual Reasoning, But Not Yet in Latent Space [65.80396132375571]
We investigate the validity of latent reasoning using Causal Mediation Analysis.<n>We show that latent tokens encode limited visual information and exhibit high similarity.<n>We propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text.
arXiv Detail & Related papers (2026-02-26T08:56:23Z)
Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM [0.0]
Prior work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity.<n>We investigate how valence-related information is represented and where it is causally used inside a transformer.<n>Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams.<n>We (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid.
arXiv Detail & Related papers (2026-02-22T12:42:38Z)
Do Latent Tokens Think? A Causal and Adversarial Analysis of Chain-of-Continuous-Thought [16.907732581097417]
We focus on Chain-of-Continuous-Thought (COCONUT) which claims better efficiency and stability than explicit Chain-of-Thought (CoT)<n>Unlike CoT tokens, COCONUT tokens show minimal sensitivity to steering and lack reasoning-critical information.<n>Results on MMLU and HotpotQA demonstrate that COCONUT consistently exploits dataset artifacts, inflating benchmark performance without true reasoning.
arXiv Detail & Related papers (2025-12-25T15:14:53Z)
Unleashing Perception-Time Scaling to Multimodal Reasoning Models [60.578179197783754]
Recent advances in inference-time scaling have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs)<n>Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear.<n>We propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems.
arXiv Detail & Related papers (2025-10-10T03:17:52Z)
Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis [3.1526281887627587]
Distinguishing recall from reasoning is crucial for predicting model generalization.<n>We use controlled datasets of synthetic linguistic puzzles to probe transformer models at the layer, head, and neuron level.<n>Our results provide the first causal evidence that recall and reasoning rely on separable but interacting circuits in transformer models.
arXiv Detail & Related papers (2025-10-03T04:13:06Z)
Latent Chain-of-Thought? Decoding the Depth-Recurrent Transformer [0.8738725605667471]
Chain-of-thought (CoT) reasoning has enabled transformer-based language models to excel at complex mathematics and multi-step planning.<n>In standard decoder-only architectures, these reasoning steps are externalized in natural language, improving interpretability at the cost of efficiency.<n>We investigate whether such reasoning structures emerge in Huginn-3.5B, a depth-recurrent Transformer that reuses layers at inference time without increasing parameter count.
arXiv Detail & Related papers (2025-07-02T23:35:21Z)
Mechanistic Interpretability in the Presence of Architectural Obfuscation [0.0]
Architectural obfuscation is a lightweight substitute for heavyweight cryptography in privacy-preserving large-language-model (LLM) inference.<n>We analyze a GPT-2-small model trained from scratch with a representative obfuscation map.<n>Our findings reveal that obfuscation dramatically alters activation patterns within attention heads yet preserves the layer-wise computational graph.
arXiv Detail & Related papers (2025-06-22T14:39:16Z)
Counterfactual reasoning: an analysis of in-context emergence [57.118735341305786]
We show that language models are capable of counterfactual reasoning.<n>We find that self-attention, model depth and pre-training data diversity drive performance.<n>Our findings extend to counterfactual reasoning under SDE dynamics.
arXiv Detail & Related papers (2025-06-05T16:02:07Z)
Concept-Guided Interpretability via Neural Chunking [64.6429903327095]
We show that neural networks exhibit patterns in their raw population activity that mirror regularities in the training data.<n>We propose three methods to extract recurring chunks on a neural population level.<n>Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data.
arXiv Detail & Related papers (2025-05-16T13:49:43Z)
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
Diagnosing Catastrophe: Large parts of accuracy loss in continual learning can be accounted for by readout misalignment [0.0]
Training artificial neural networks on changing data distributions leads to a rapid decrease in performance on old tasks. We investigate the representational changes that underlie this performance decrease and identify three distinct processes that together account for the phenomenon.
arXiv Detail & Related papers (2023-10-09T11:57:46Z)
OOD-CV-v2: An extended Benchmark for Robustness to Out-of-Distribution Shifts of Individual Nuisances in Natural Images [59.51657161097337]
OOD-CV-v2 is a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions. In addition to this novel dataset, we contribute extensive experiments using popular baseline methods.
arXiv Detail & Related papers (2023-04-17T20:39:25Z)
What can we learn about a generated image corrupting its latent representation? [57.1841740328509]
We investigate the hypothesis that we can predict image quality based on its latent representation in the GANs bottleneck. We achieve this by corrupting the latent representation with noise and generating multiple outputs.
arXiv Detail & Related papers (2022-10-12T14:40:32Z)
Where and What? Examining Interpretable Disentangled Representations [96.32813624341833]
Capturing interpretable variations has long been one of the goals in disentanglement learning. Unlike the independence assumption, interpretability has rarely been exploited to encourage disentanglement in the unsupervised setting. In this paper, we examine the interpretability of disentangled representations by investigating two questions: where to be interpreted and what to be interpreted.
arXiv Detail & Related papers (2021-04-07T11:22:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.