Concept-SAE: Active Causal Probing of Visual Model Behavior
- URL: http://arxiv.org/abs/2509.22015v1
- Date: Fri, 26 Sep 2025 07:51:03 GMT
- Title: Concept-SAE: Active Causal Probing of Visual Model Behavior
- Authors: Jianrong Ding, Muxi Chen, Chenchen Zhao, Qiang Xu,
- Abstract summary: Concept-SAE is a framework that forges semantically grounded concept tokens through a novel hybrid disentanglement strategy.<n>We first quantitatively demonstrate that our dual-supervision approach produces tokens that are remarkably faithful and spatially localized.<n>This validated fidelity enables two critical applications: (1) we probe the causal link between internal concepts and predictions via direct intervention, and (2) we probe the model's failure modes by systematically localizing adversarial vulnerabilities to specific layers.
- Score: 10.346577706023139
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Standard Sparse Autoencoders (SAEs) excel at discovering a dictionary of a model's learned features, offering a powerful observational lens. However, the ambiguous and ungrounded nature of these features makes them unreliable instruments for the active, causal probing of model behavior. To solve this, we introduce Concept-SAE, a framework that forges semantically grounded concept tokens through a novel hybrid disentanglement strategy. We first quantitatively demonstrate that our dual-supervision approach produces tokens that are remarkably faithful and spatially localized, outperforming alternative methods in disentanglement. This validated fidelity enables two critical applications: (1) we probe the causal link between internal concepts and predictions via direct intervention, and (2) we probe the model's failure modes by systematically localizing adversarial vulnerabilities to specific layers. Concept-SAE provides a validated blueprint for moving beyond correlational interpretation to the mechanistic, causal probing of model behavior.
Related papers
- Unsupervised Synthetic Image Attribution: Alignment and Disentanglement [55.853285140682665]
We propose a simple yet effective unsupervised method called Alignment and Disentanglement.<n>Specifically, we begin by performing basic concept alignment using contrastive self-supervised learning.<n>Next, we enhance the model's attribution ability by promoting representation disentanglement with the Infomax loss.
arXiv Detail & Related papers (2026-01-30T07:31:53Z) - CASL: Concept-Aligned Sparse Latents for Interpreting Diffusion Models [45.90361318326864]
Internal activations of diffusion models encode rich semantic information, but interpreting such representations remains challenging.<n>We introduce CASL (Concept-Aligned Sparse Latents), a supervised framework that aligns sparse latent dimensions of diffusion models with semantic concepts.<n>Unlike editing methods, CASL-Steer is used solely as a causal probe to reveal how concept-aligned latents influence generated content.
arXiv Detail & Related papers (2026-01-21T20:14:17Z) - Enhancing Interpretability for Vision Models via Shapley Value Optimization [10.809438356590988]
Self-explaining neural networks sacrifice performance and compatibility due to their specialized architectural designs.<n>We propose a novel self-explaining framework that integrates Shapley value estimation as an auxiliary task during training.<n>Our method achieves state-of-the-art interpretability.
arXiv Detail & Related papers (2025-12-16T12:33:04Z) - FACE: Faithful Automatic Concept Extraction [4.417419748257645]
FACE (Faithful Automatic Concept Extraction) is a novel framework that augments Non-negative Matrix Factorization (NMF) with a Kullback-Leibler (KL) divergence regularization term to ensure alignment between the model's original and concept-based predictions.<n>We provide theoretical guarantees showing that minimizing the KL divergence bounds the deviation in predictive distributions, thereby promoting faithful local linearity in the learned concept space.
arXiv Detail & Related papers (2025-10-13T17:44:45Z) - The Unified Cognitive Consciousness Theory for Language Models: Anchoring Semantics, Thresholds of Activation, and Emergent Reasoning [2.0800882594868293]
Unified Cognitive Consciousness Theory (UCCT) casts them as vast unconscious pattern repositories.<n>UCCT formalizes this process as Bayesian competition between statistical priors learned in pre-training and context-driven target patterns.<n>We ground the theory in three principles: threshold crossing, modality, and density-distance predictive power.
arXiv Detail & Related papers (2025-06-02T18:12:43Z) - Vision Foundation Model Embedding-Based Semantic Anomaly Detection [12.940376547110509]
This work explores semantic anomaly detection by leveraging the semantic priors of state-of-the-art vision foundation models.<n>We propose a framework that compares local vision embeddings from runtime images to a database of nominal scenarios in which the autonomous system is deemed safe and performant.
arXiv Detail & Related papers (2025-05-12T19:00:29Z) - Leakage and Interpretability in Concept-Based Models [0.24466725954625887]
Concept Bottleneck Models aim to improve interpretability by predicting high-level intermediate concepts.<n>They are known to suffer from information leakage, whereby models exploit unintended information encoded within the learned concepts.<n>We introduce an information-theoretic framework to rigorously characterise and quantify leakage.
arXiv Detail & Related papers (2025-04-18T22:21:06Z) - HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model [54.64088247291416]
A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments.<n>Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction.<n>We introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression.
arXiv Detail & Related papers (2025-03-13T17:59:52Z) - Concept Layers: Enhancing Interpretability and Intervenability via LLM Conceptualization [2.163881720692685]
We introduce a new methodology for incorporating interpretability and intervenability into an existing model by integrating Concept Layers into its architecture.<n>Our approach projects the model's internal vector representations into a conceptual, explainable vector space before reconstructing and feeding them back into the model.<n>We evaluate CLs across multiple tasks, demonstrating that they maintain the original model's performance and agreement while enabling meaningful interventions.
arXiv Detail & Related papers (2025-02-19T11:10:19Z) - On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [68.62012304574012]
multimodal generative models have sparked critical discussions on their reliability, fairness and potential for misuse.<n>We propose an evaluation framework to assess model reliability by analyzing responses to global and local perturbations in the embedding space.<n>Our method lays the groundwork for detecting unreliable, bias-injected models and tracing the provenance of embedded biases.
arXiv Detail & Related papers (2024-11-21T09:46:55Z) - Predictive Churn with the Set of Good Models [61.00058053669447]
This paper explores connections between two seemingly unrelated concepts of predictive inconsistency.<n>The first, known as predictive multiplicity, occurs when models that perform similarly produce conflicting predictions for individual samples.<n>The second concept, predictive churn, examines the differences in individual predictions before and after model updates.
arXiv Detail & Related papers (2024-02-12T16:15:25Z) - JAB: Joint Adversarial Prompting and Belief Augmentation [81.39548637776365]
We introduce a joint framework in which we probe and improve the robustness of a black-box target model via adversarial prompting and belief augmentation.
This framework utilizes an automated red teaming approach to probe the target model, along with a belief augmenter to generate instructions for the target model to improve its robustness to those adversarial probes.
arXiv Detail & Related papers (2023-11-16T00:35:54Z) - Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models.
It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content.
Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.