Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models
- URL: http://arxiv.org/abs/2601.21944v1
- Date: Thu, 29 Jan 2026 16:28:55 GMT
- Title: Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models
- Authors: Konstantinos P. Panousis, Diego Marcos,
- Abstract summary: Vision-Language Models (VLMs) are often treated as black-boxes, with limited or non-existent investigation of their decision making process.<n>We introduce the notion of clarity, a measure, capturing the interplay between the downstream performance and the sparsity and precision of the concept representation.<n>Our experiments reveal a critical trade-off between flexibility and interpretability, under which a given method can exhibit markedly different behaviors even at comparable performance levels.
- Score: 12.322360020814516
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread adoption of Vision-Language Models (VLMs) across fields has amplified concerns about model interpretability. Distressingly, these models are often treated as black-boxes, with limited or non-existent investigation of their decision making process. Despite numerous post- and ante-hoc interepretability methods, systematic and objective evaluation of the learned representations remains limited, particularly for sparsity-aware methods that are increasingly considered to "induce interpretability". In this work, we focus on Concept Bottleneck Models and investigate how different modeling decisions affect the emerging representations. We introduce the notion of clarity, a measure, capturing the interplay between the downstream performance and the sparsity and precision of the concept representation, while proposing an interpretability assessment framework using datasets with ground truth concept annotations. We consider both VLM- and attribute predictor-based CBMs, and three different sparsity-inducing strategies: per example $\ell_1, \ell_0$ and Bernoulli-based formulations. Our experiments reveal a critical trade-off between flexibility and interpretability, under which a given method can exhibit markedly different behaviors even at comparable performance levels. The code will be made publicly available upon publication.
Related papers
- GlassMol: Interpretable Molecular Property Prediction with Concept Bottleneck Models [26.551184488481912]
In drug discovery, where safety is critical, machine learning models operate as black boxes.<n>Existing interpretability methods suffer from the effectiveness-trustworthiness trade-off.<n>We introduce GlassMol, a model-agnostic CBM that addresses these gaps through automated concept curation and LLM-guided concept selection.
arXiv Detail & Related papers (2026-03-01T21:07:49Z) - Enhancing Interpretability for Vision Models via Shapley Value Optimization [10.809438356590988]
Self-explaining neural networks sacrifice performance and compatibility due to their specialized architectural designs.<n>We propose a novel self-explaining framework that integrates Shapley value estimation as an auxiliary task during training.<n>Our method achieves state-of-the-art interpretability.
arXiv Detail & Related papers (2025-12-16T12:33:04Z) - LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z) - Towards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model [5.700536552863068]
Concept Bottleneck Models (CBMs) enhance interpretability by predicting human-understandable concepts as intermediate representations.<n>We propose a lightweight Disentangled Concept Bottleneck Model (LDCBM) that automatically groups visual features into semantically meaningful components.<n> Experiments on three diverse datasets demonstrate that LDCBM achieves higher concept and class accuracy, outperforming previous CBMs in both interpretability and classification performance.
arXiv Detail & Related papers (2025-10-17T15:59:30Z) - Interpretable Reward Modeling with Active Concept Bottlenecks [54.00085739303773]
We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning.<n>Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts.<n>We formalize an active learning strategy that dynamically acquires the most informative concept labels.
arXiv Detail & Related papers (2025-07-07T06:26:04Z) - Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z) - Investigating the Duality of Interpretability and Explainability in Machine Learning [2.8311451575532156]
Complex "black box" models exhibit exceptional predictive performance.<n>Their inherently opaque nature raises concerns about transparency and interpretability.<n>Efforts are focused on explaining these models instead of developing ones that are inherently interpretable.
arXiv Detail & Related papers (2025-03-27T10:48:40Z) - Concept Layers: Enhancing Interpretability and Intervenability via LLM Conceptualization [2.163881720692685]
We introduce a new methodology for incorporating interpretability and intervenability into an existing model by integrating Concept Layers into its architecture.<n>Our approach projects the model's internal vector representations into a conceptual, explainable vector space before reconstructing and feeding them back into the model.<n>We evaluate CLs across multiple tasks, demonstrating that they maintain the original model's performance and agreement while enabling meaningful interventions.
arXiv Detail & Related papers (2025-02-19T11:10:19Z) - Linearly-Interpretable Concept Embedding Models for Text Analysis [9.340843984411137]
We propose a novel Linearly Interpretable Concept Embedding Model (LICEM)<n>LICEMs classification accuracy is better than existing interpretable models and matches black-box ones.<n>We show that the explanations provided by our models are more interveneable and causally consistent with respect to existing solutions.
arXiv Detail & Related papers (2024-06-20T14:04:53Z) - Sparsity-Guided Holistic Explanation for LLMs with Interpretable
Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains.
The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications.
We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z) - Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks.
The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation.
We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z) - Generative Counterfactuals for Neural Networks via Attribute-Informed
Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP)
By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently.
Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.