Related papers: Concept-SAE: Active Causal Probing of Visual Model Behavior

Concept-SAE: Active Causal Probing of Visual Model Behavior

URL: http://arxiv.org/abs/2509.22015v1
Date: Fri, 26 Sep 2025 07:51:03 GMT
Title: Concept-SAE: Active Causal Probing of Visual Model Behavior
Authors: Jianrong Ding, Muxi Chen, Chenchen Zhao, Qiang Xu,
Abstract summary: Concept-SAE is a framework that forges semantically grounded concept tokens through a novel hybrid disentanglement strategy.<n>We first quantitatively demonstrate that our dual-supervision approach produces tokens that are remarkably faithful and spatially localized.<n>This validated fidelity enables two critical applications: (1) we probe the causal link between internal concepts and predictions via direct intervention, and (2) we probe the model's failure modes by systematically localizing adversarial vulnerabilities to specific layers.
Score: 10.346577706023139
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model's learned features, offering a powerful observational lens. However, the ambiguous and ungrounded nature of these features makes them unreliable instruments for the active, causal probing of model behavior. To solve this, we introduce Concept-SAE, a framework that forges semantically grounded concept tokens through a novel hybrid disentanglement strategy. We first quantitatively demonstrate that our dual-supervision approach produces tokens that are remarkably faithful and spatially localized, outperforming alternative methods in disentanglement. This validated fidelity enables two critical applications: (1) we probe the causal link between internal concepts and predictions via direct intervention, and (2) we probe the model's failure modes by systematically localizing adversarial vulnerabilities to specific layers. Concept-SAE provides a validated blueprint for moving beyond correlational interpretation to the mechanistic, causal probing of model behavior.

Related papers

Unsupervised Synthetic Image Attribution: Alignment and Disentanglement [55.853285140682665]
We propose a simple yet effective unsupervised method called Alignment and Disentanglement.<n>Specifically, we begin by performing basic concept alignment using contrastive self-supervised learning.<n>Next, we enhance the model's attribution ability by promoting representation disentanglement with the Infomax loss.
arXiv Detail & Related papers (2026-01-30T07:31:53Z)
CASL: Concept-Aligned Sparse Latents for Interpreting Diffusion Models [45.90361318326864]
Internal activations of diffusion models encode rich semantic information, but interpreting such representations remains challenging.<n>We introduce CASL (Concept-Aligned Sparse Latents), a supervised framework that aligns sparse latent dimensions of diffusion models with semantic concepts.<n>Unlike editing methods, CASL-Steer is used solely as a causal probe to reveal how concept-aligned latents influence generated content.
arXiv Detail & Related papers (2026-01-21T20:14:17Z)
Enhancing Interpretability for Vision Models via Shapley Value Optimization [10.809438356590988]
Self-explaining neural networks sacrifice performance and compatibility due to their specialized architectural designs.<n>We propose a novel self-explaining framework that integrates Shapley value estimation as an auxiliary task during training.<n>Our method achieves state-of-the-art interpretability.
arXiv Detail & Related papers (2025-12-16T12:33:04Z)
FACE: Faithful Automatic Concept Extraction [4.417419748257645]
FACE (Faithful Automatic Concept Extraction) is a novel framework that augments Non-negative Matrix Factorization (NMF) with a Kullback-Leibler (KL) divergence regularization term to ensure alignment between the model's original and concept-based predictions.<n>We provide theoretical guarantees showing that minimizing the KL divergence bounds the deviation in predictive distributions, thereby promoting faithful local linearity in the learned concept space.
arXiv Detail & Related papers (2025-10-13T17:44:45Z)
The Unified Cognitive Consciousness Theory for Language Models: Anchoring Semantics, Thresholds of Activation, and Emergent Reasoning [2.0800882594868293]
Unified Cognitive Consciousness Theory (UCCT) casts them as vast unconscious pattern repositories.<n>UCCT formalizes this process as Bayesian competition between statistical priors learned in pre-training and context-driven target patterns.<n>We ground the theory in three principles: threshold crossing, modality, and density-distance predictive power.
arXiv Detail & Related papers (2025-06-02T18:12:43Z)
Vision Foundation Model Embedding-Based Semantic Anomaly Detection [12.940376547110509]
This work explores semantic anomaly detection by leveraging the semantic priors of state-of-the-art vision foundation models.<n>We propose a framework that compares local vision embeddings from runtime images to a database of nominal scenarios in which the autonomous system is deemed safe and performant.
arXiv Detail & Related papers (2025-05-12T19:00:29Z)
Leakage and Interpretability in Concept-Based Models [0.24466725954625887]
Concept Bottleneck Models aim to improve interpretability by predicting high-level intermediate concepts.<n>They are known to suffer from information leakage, whereby models exploit unintended information encoded within the learned concepts.<n>We introduce an information-theoretic framework to rigorously characterise and quantify leakage.
arXiv Detail & Related papers (2025-04-18T22:21:06Z)
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model [54.64088247291416]
A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments.<n>Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction.<n>We introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression.
arXiv Detail & Related papers (2025-03-13T17:59:52Z)
Concept Layers: Enhancing Interpretability and Intervenability via LLM Conceptualization [2.163881720692685]
We introduce a new methodology for incorporating interpretability and intervenability into an existing model by integrating Concept Layers into its architecture.<n>Our approach projects the model's internal vector representations into a conceptual, explainable vector space before reconstructing and feeding them back into the model.<n>We evaluate CLs across multiple tasks, demonstrating that they maintain the original model's performance and agreement while enabling meaningful interventions.
arXiv Detail & Related papers (2025-02-19T11:10:19Z)
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [68.62012304574012]
multimodal generative models have sparked critical discussions on their reliability, fairness and potential for misuse.<n>We propose an evaluation framework to assess model reliability by analyzing responses to global and local perturbations in the embedding space.<n>Our method lays the groundwork for detecting unreliable, bias-injected models and tracing the provenance of embedded biases.
arXiv Detail & Related papers (2024-11-21T09:46:55Z)
Predictive Churn with the Set of Good Models [61.00058053669447]
This paper explores connections between two seemingly unrelated concepts of predictive inconsistency.<n>The first, known as predictive multiplicity, occurs when models that perform similarly produce conflicting predictions for individual samples.<n>The second concept, predictive churn, examines the differences in individual predictions before and after model updates.
arXiv Detail & Related papers (2024-02-12T16:15:25Z)
JAB: Joint Adversarial Prompting and Belief Augmentation [81.39548637776365]
We introduce a joint framework in which we probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation. This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes.
arXiv Detail & Related papers (2023-11-16T00:35:54Z)
Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models. It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content. Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.