TransHP: Image Classification with Hierarchical Prompting
- URL: http://arxiv.org/abs/2304.06385v5
- Date: Wed, 20 Dec 2023 01:28:57 GMT
- Title: TransHP: Image Classification with Hierarchical Prompting
- Authors: Wenhao Wang, Yifan Sun, Wei Li, Yi Yang
- Abstract summary: This paper explores a hierarchical prompting mechanism for the hierarchical image classification (HIC) task.
We think it well imitates human visual recognition, i.e., humans may use the ancestor class as a prompt to draw focus on subtle differences among descendant classes.
- Score: 27.049504972041834
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper explores a hierarchical prompting mechanism for the hierarchical
image classification (HIC) task. Different from prior HIC methods, our
hierarchical prompting is the first to explicitly inject ancestor-class
information as a tokenized hint that benefits the descendant-class
discrimination. We think it well imitates human visual recognition, i.e.,
humans may use the ancestor class as a prompt to draw focus on the subtle
differences among descendant classes. We model this prompting mechanism into a
Transformer with Hierarchical Prompting (TransHP). TransHP consists of three
steps: 1) learning a set of prompt tokens to represent the coarse (ancestor)
classes, 2) on-the-fly predicting the coarse class of the input image at an
intermediate block, and 3) injecting the prompt token of the predicted coarse
class into the intermediate feature. Though the parameters of TransHP maintain
the same for all input images, the injected coarse-class prompt conditions
(modifies) the subsequent feature extraction and encourages a dynamic focus on
relatively subtle differences among the descendant classes. Extensive
experiments show that TransHP improves image classification on accuracy (e.g.,
improving ViT-B/16 by +2.83% ImageNet classification accuracy), training data
efficiency (e.g., +12.69% improvement under 10% ImageNet training data), and
model explainability. Moreover, TransHP also performs favorably against prior
HIC methods, showing that TransHP well exploits the hierarchical information.
The code is available at: https://github.com/WangWenhao0716/TransHP.
Related papers
- Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models [52.3032592038514]
We propose a class-aware text prompt to enrich generated prompts with label-related image information.
We achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks.
arXiv Detail & Related papers (2023-03-30T06:02:40Z) - Data Augmentation Vision Transformer for Fine-grained Image
Classification [1.6211899643913996]
We propose a data augmentation vision transformer (DAVT) based on data augmentation.
We also propose a hierarchical attention selection (HAS) method, which improves the ability of discriminative markers between levels of learning.
Experimental results show that the accuracy of this method on the two general datasets, CUB-200-2011, and Stanford Dogs, is better than the existing mainstream methods.
arXiv Detail & Related papers (2022-11-23T11:34:11Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - iCAR: Bridging Image Classification and Image-text Alignment for Visual
Recognition [33.2800417526215]
Image classification, which classifies images by pre-defined categories, has been the dominant approach to visual representation learning over the last decade.
Visual learning through image-text alignment, however, has emerged to show promising performance, especially for zero-shot recognition.
We propose a deep fusion method with three adaptations that effectively bridge two learning tasks.
arXiv Detail & Related papers (2022-04-22T15:27:21Z) - SGNet: A Super-class Guided Network for Image Classification and Object
Detection [15.853822797338655]
The paper proposes a super-class guided network (SGNet) to integrate the high-level semantic information into the network.
The experimental results validate the proposed approach and demonstrate its superior performance on image classification and object detection.
arXiv Detail & Related papers (2021-04-26T22:26:12Z) - Isometric Propagation Network for Generalized Zero-shot Learning [72.02404519815663]
A popular strategy is to learn a mapping between the semantic space of class attributes and the visual space of images based on the seen classes and their data.
We propose Isometric propagation Network (IPN), which learns to strengthen the relation between classes within each space and align the class dependency in the two spaces.
IPN achieves state-of-the-art performance on three popular Zero-shot learning benchmarks.
arXiv Detail & Related papers (2021-02-03T12:45:38Z) - Attribute Propagation Network for Graph Zero-shot Learning [57.68486382473194]
We introduce the attribute propagation network (APNet), which is composed of 1) a graph propagation model generating attribute vector for each class and 2) a parameterized nearest neighbor (NN) classifier.
APNet achieves either compelling performance or new state-of-the-art results in experiments with two zero-shot learning settings and five benchmark datasets.
arXiv Detail & Related papers (2020-09-24T16:53:40Z) - Zero-Shot Recognition through Image-Guided Semantic Classification [9.291055558504588]
We present a new embedding-based framework for zero-shot learning (ZSL)
Motivated by the binary relevance method for multi-label classification, we propose to inversely learn the mapping between an image and a semantic classifier.
IGSC is conceptually simple and can be realized by a slight enhancement of an existing deep architecture for classification.
arXiv Detail & Related papers (2020-07-23T06:22:40Z) - SCAN: Learning to Classify Images without Labels [73.69513783788622]
We advocate a two-step approach where feature learning and clustering are decoupled.
A self-supervised task from representation learning is employed to obtain semantically meaningful features.
We obtain promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime.
arXiv Detail & Related papers (2020-05-25T18:12:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.