Related papers: ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for Enhanced Alignment in Vision Transformers

ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for Enhanced Alignment in Vision Transformers

URL: http://arxiv.org/abs/2501.09221v2
Date: Mon, 03 Feb 2025 19:53:49 GMT
Title: ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for Enhanced Alignment in Vision Transformers
Authors: Sanchit Sinha, Guangzhi Xiong, Aidong Zhang,
Abstract summary: ASCENT-ViT is an attention-based, concept learning framework for Vision Transformers (ViTs)<n>It composes scale and position-aware representations from multiscale feature pyramids and ViT patch representations, respectively.<n>It can be utilized as a classification head on top of standard ViT backbones for improved predictive performance and accurate and robust concept explanations.
Score: 29.932706137805713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Vision Transformers (ViTs) are increasingly adopted in sensitive vision applications, there is a growing demand for improved interpretability. This has led to efforts to forward-align these models with carefully annotated abstract, human-understandable semantic entities - concepts. Concepts provide global rationales to the model predictions and can be quickly understood/intervened on by domain experts. Most current research focuses on designing model-agnostic, plug-and-play generic concept-based explainability modules that do not incorporate the inner workings of foundation models (e.g., inductive biases, scale invariance, etc.) during training. To alleviate this issue for ViTs, in this paper, we propose ASCENT-ViT, an attention-based, concept learning framework that effectively composes scale and position-aware representations from multiscale feature pyramids and ViT patch representations, respectively. Further, these representations are aligned with concept annotations through attention matrices - which incorporate spatial and global (semantic) concepts. ASCENT-ViT can be utilized as a classification head on top of standard ViT backbones for improved predictive performance and accurate and robust concept explanations as demonstrated on five datasets, including three widely used benchmarks (CUB, Pascal APY, Concept-MNIST) and 2 real-world datasets (AWA2, KITS).

Related papers

Hierarchical Concept-based Interpretable Models [23.16720677779406]
Concept Embedding Models (CEMs) map inputs to human-interpretable concept representations from which tasks can be predicted.<n>Yet, CEMs fail to represent inter-concept relationships and require concept annotations at different granularities during training.<n>We introduce Hierarchical Concept Embedding Models (HiCEMs), a new family of CEMs that explicitly model concept relationships through hierarchical structures.
arXiv Detail & Related papers (2026-02-27T11:49:56Z)
Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z)
FaCT: Faithful Concept Traces for Explaining Neural Network Decisions [56.796533084868884]
Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge.<n>We put emphasis on the faithfulness of concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations.<n>Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced.
arXiv Detail & Related papers (2025-10-29T13:35:46Z)
Interpretable Few-Shot Image Classification via Prototypical Concept-Guided Mixture of LoRA Experts [79.18608192761512]
Self-Explainable Models (SEMs) rely on Prototypical Concept Learning (PCL) to enable their visual recognition processes more interpretable.<n>We propose a Few-Shot Prototypical Concept Classification framework that mitigates two key challenges under low-data regimes: parametric imbalance and representation misalignment.<n>Our approach consistently outperforms existing SEMs by a notable margin, with 4.2%-8.7% relative gains in 5-way 5-shot classification.
arXiv Detail & Related papers (2025-06-05T06:39:43Z)
Escaping Plato's Cave: Robust Conceptual Reasoning through Interpretable 3D Neural Object Volumes [65.63534641857476]
We introduce CAVE - Concept Aware Volumes for Explanations - a new direction that unifies interpretability and robustness in image classification. We design an inherently-interpretable and robust classifier by extending existing 3D-aware classifiers with concepts extracted from their volumetric representations for classification. In an array of quantitative metrics for interpretability, we compare against different concept-based approaches across the explainable AI literature and show that CAVE discovers well-grounded concepts that are used consistently across images, while achieving superior robustness.
arXiv Detail & Related papers (2025-03-17T17:55:15Z)
Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment [53.90425382758605]
We show how fine-tuning alters the internal structure of a model to specialize in new multimodal tasks.<n>Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks.
arXiv Detail & Related papers (2025-01-06T13:37:13Z)
EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings. EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z)
Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models [6.747023750015197]
This paper proposes five desiderata for explaining ViTs -- faithfulness, stability, sparsity, multi-level structure, and parsimony. We introduce a variational Bayesian explanation framework, dubbed ProbAbilistic Concept Explainers (PACE)
arXiv Detail & Related papers (2024-06-18T14:17:57Z)
LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification [5.8754760054410955]
We introduce textttHi-CoDecomposition, a novel framework designed to enhance model interpretability through structured concept analysis. Our approach not only aligns with the performance of state-of-the-art models but also advances transparency by providing clear insights into the decision-making process.
arXiv Detail & Related papers (2024-05-29T00:36:56Z)
Advancing Ante-Hoc Explainable Models through Generative Adversarial Networks [24.45212348373868]
This paper presents a novel concept learning framework for enhancing model interpretability and performance in visual classification tasks. Our approach appends an unsupervised explanation generator to the primary classifier network and makes use of adversarial training. This work presents a significant step towards building inherently interpretable deep vision models with task-aligned concept representations.
arXiv Detail & Related papers (2024-01-09T16:16:16Z)
ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities. We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z)
Concept-Centric Transformers: Enhancing Model Interpretability through Object-Centric Concept Learning within a Shared Global Workspace [1.6574413179773757]
Concept-Centric Transformers is a simple yet effective configuration of the shared global workspace for interpretability. We show that our model achieves better classification accuracy than all baselines across all problems.
arXiv Detail & Related papers (2023-05-25T06:37:39Z)
SIM-Trans: Structure Information Modeling Transformer for Fine-grained Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning. The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily. Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z)
Semantic-aware Modular Capsule Routing for Visual Question Answering [55.03883681191765]
We propose a Semantic-aware modUlar caPsulE framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics. We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets.
arXiv Detail & Related papers (2022-07-21T10:48:37Z)
Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR) Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z)
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning [139.0548263507796]
We use vision transformers (ViTs) as our base model for visual reasoning. We make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs. We show the resulting model, Concept-guided Vision Transformer (or RelViT for short), significantly outperforms prior approaches on HICO and GQA benchmarks.
arXiv Detail & Related papers (2022-04-24T02:46:43Z)
A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts. In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images. We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z)
Visual Concept Reasoning Networks [93.99840807973546]
A split-transform-merge strategy has been broadly used as an architectural constraint in convolutional neural networks for visual recognition tasks. We propose to exploit this strategy and combine it with our Visual Concept Reasoning Networks (VCRNet) to enable reasoning between high-level visual concepts. Our proposed model, VCRNet, consistently improves the performance by increasing the number of parameters by less than 1%.
arXiv Detail & Related papers (2020-08-26T20:02:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.