ComFe: Interpretable Image Classifiers With Foundation Models, Transformers and Component Features
- URL: http://arxiv.org/abs/2403.04125v3
- Date: Fri, 24 May 2024 06:10:35 GMT
- Title: ComFe: Interpretable Image Classifiers With Foundation Models, Transformers and Component Features
- Authors: Evelyn Mannix, Howard Bondell,
- Abstract summary: Component Features (ComFe) is a novel interpretable-by-design image classification approach.
It is highly scalable and can obtain better accuracy and robustness in comparison to non-interpretable methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Interpretable computer vision models are able to explain their reasoning through comparing the distances between the image patch embeddings and prototypes within a latent space. However, many of these approaches introduce additional complexity, can require multiple training steps and often have a performance cost in comparison to black-box approaches. In this work, we introduce Component Features (ComFe), a novel interpretable-by-design image classification approach that is highly scalable and can obtain better accuracy and robustness in comparison to non-interpretable methods. Inspired by recent developments in computer vision foundation models, ComFe uses a transformer-decoder head and a hierarchical mixture-modelling approach with a foundation model backbone to obtain higher accuracy compared to previous interpretable models across a range of fine-grained vision benchmarks, without the need to individually tune hyper-parameters for each dataset. With only global image labels and no segmentation or part annotations, ComFe can identify consistent component features within an image and determine which of these features are informative in making a prediction.
Related papers
- TCSAFormer: Efficient Vision Transformer with Token Compression and Sparse Attention for Medical Image Segmentation [1.474723404975345]
We propose an efficient medical image segmentation network, named TCSAFormer.<n>It incorporates a Compressed Attention (CA) module, which combines token compression and pixel-level sparse attention to dynamically focus on the most relevant key-value pairs for each query.<n>It introduces a Dual-Branch Feed-Forward Network (DBFFN) module as a replacement for the standard FFN to capture local contextual features and multiscale information.
arXiv Detail & Related papers (2025-08-06T03:38:07Z) - FCC: Fully Connected Correlation for Few-Shot Segmentation [11.277022867553658]
Few-shot segmentation (FSS) aims to segment the target object in a query image using only a small set of support images and masks.
Previous methods have tried to obtain prior information by creating correlation maps from pixel-level correlation on final-layer or same-layer features.
We introduce FCC (Fully Connected Correlation) to integrate pixel-level correlations between support and query features.
arXiv Detail & Related papers (2024-11-18T03:32:02Z) - Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling [58.50618448027103]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
This paper explores the differences across various CLIP-trained vision backbones.
Method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone.
arXiv Detail & Related papers (2024-05-27T12:59:35Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion [23.62010759076202]
We formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels.
Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy.
arXiv Detail & Related papers (2023-12-17T11:59:14Z) - With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - Prefix Conditioning Unifies Language and Label Supervision [84.11127588805138]
We show that dataset biases negatively affect pre-training by reducing the generalizability of learned representations.
In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.
arXiv Detail & Related papers (2022-06-02T16:12:26Z) - Semantic keypoint-based pose estimation from single RGB frames [64.80395521735463]
We present an approach to estimating the continuous 6-DoF pose of an object from a single RGB image.
The approach combines semantic keypoints predicted by a convolutional network (convnet) with a deformable shape model.
We show that our approach can accurately recover the 6-DoF object pose for both instance- and class-based scenarios.
arXiv Detail & Related papers (2022-04-12T15:03:51Z) - Palette: Image-to-Image Diffusion Models [50.268441533631176]
We introduce Palette, a simple and general framework for image-to-image translation using conditional diffusion models.
On four challenging image-to-image translation tasks, Palette outperforms strong GAN and regression baselines.
We report several sample quality scores including FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against reference images.
arXiv Detail & Related papers (2021-11-10T17:49:29Z) - Multi-dataset Pretraining: A Unified Model for Semantic Segmentation [97.61605021985062]
We propose a unified framework, termed as Multi-Dataset Pretraining, to take full advantage of the fragmented annotations of different datasets.
This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets.
In order to better model the relationship among images and classes from different datasets, we extend the pixel level embeddings via cross dataset mixing.
arXiv Detail & Related papers (2021-06-08T06:13:11Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - So-ViT: Mind Visual Tokens for Vision Transformer [27.243241133304785]
We propose a new classification paradigm, where the second-order, cross-covariance pooling of visual tokens is combined with class token for final classification.
We develop a light-weight, hierarchical module based on off-the-shelf convolutions for visual token embedding.
The results show our models, when trained from scratch, outperform the competing ViT variants, while being on par with or better than state-of-the-art CNN models.
arXiv Detail & Related papers (2021-04-22T09:05:09Z) - Match Them Up: Visually Explainable Few-shot Image Classification [27.867833878756553]
Few-shot learning is usually based on an assumption that the pre-trained knowledge can be obtained from base (seen) categories and can be well transferred to novel (unseen) categories.
In this paper, we reveal a new way to perform FSL for image classification, using visual representations from the backbone model and weights generated by a newly-emerged explainable classifier.
Experimental results prove that the proposed method can achieve both good accuracy and satisfactory explainability on three mainstream datasets.
arXiv Detail & Related papers (2020-11-25T05:47:35Z) - FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning [64.32306537419498]
We propose a novel learned feature-based refinement and augmentation method that produces a varied set of complex transformations.
These transformations also use information from both within-class and across-class representations that we extract through clustering.
We demonstrate that our method is comparable to current state of art for smaller datasets while being able to scale up to larger datasets.
arXiv Detail & Related papers (2020-07-16T17:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.