Diverse Instance Discovery: Vision-Transformer for Instance-Aware
Multi-Label Image Recognition
- URL: http://arxiv.org/abs/2204.10731v1
- Date: Fri, 22 Apr 2022 14:38:40 GMT
- Title: Diverse Instance Discovery: Vision-Transformer for Instance-Aware
Multi-Label Image Recognition
- Authors: Yunqing Hu, Xuan Jin, Yin Zhang, Haiwen Hong, Jingfeng Zhang, Feihu
Yan, Yuan He, Hui Xue
- Abstract summary: Vision Transformer (ViT) is the research base for this paper.
Our goal is to leverage ViT's patch tokens and self-attention mechanism to mine rich instances in multi-label images.
We propose a weakly supervised object localization-based approach to extract multi-scale local features.
- Score: 24.406654146411682
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous works on multi-label image recognition (MLIR) usually use CNNs as a
starting point for research. In this paper, we take pure Vision Transformer
(ViT) as the research base and make full use of the advantages of Transformer
with long-range dependency modeling to circumvent the disadvantages of CNNs
limited to local receptive field. However, for multi-label images containing
multiple objects from different categories, scales, and spatial relations, it
is not optimal to use global information alone. Our goal is to leverage ViT's
patch tokens and self-attention mechanism to mine rich instances in multi-label
images, named diverse instance discovery (DiD). To this end, we propose a
semantic category-aware module and a spatial relationship-aware module,
respectively, and then combine the two by a re-constraint strategy to obtain
instance-aware attention maps. Finally, we propose a weakly supervised object
localization-based approach to extract multi-scale local features, to form a
multi-view pipeline. Our method requires only weakly supervised information at
the label level, no additional knowledge injection or other strongly supervised
information is required. Experiments on three benchmark datasets show that our
method significantly outperforms previous works and achieves state-of-the-art
results under fair experimental comparisons.
Related papers
- VLMine: Long-Tail Data Mining with Vision Language Models [18.412533708652102]
This work focuses on the problem of identifying rare examples within a corpus of unlabeled data.
We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM)
Our experiments consistently show large improvements (between 10% and 50%) over the baseline techniques.
arXiv Detail & Related papers (2024-09-23T19:13:51Z) - HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification [15.129037250680582]
Tight visual-linguistic interactions play a vital role in improving classification performance.
Recent Transformer-based methods have achieved great success in multi-label image classification.
We propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs.
arXiv Detail & Related papers (2024-07-23T07:31:42Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Object-Aware Self-supervised Multi-Label Learning [9.496981642855769]
We propose an Object-Aware Self-Supervision (OASS) method to obtain more fine-grained representations for multi-label learning.
The proposed method can be leveraged to efficiently generate Class-Specific Instances (CSI) in a proposal-free fashion.
Experiments on the VOC2012 dataset for multi-label classification demonstrate the effectiveness of the proposed method against the state-of-the-art counterparts.
arXiv Detail & Related papers (2022-05-14T10:14:08Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Multi-level Second-order Few-shot Learning [111.0648869396828]
We propose a Multi-level Second-order (MlSo) few-shot learning network for supervised or unsupervised few-shot image classification and few-shot action recognition.
We leverage so-called power-normalized second-order base learner streams combined with features that express multiple levels of visual abstraction.
We demonstrate respectable results on standard datasets such as Omniglot, mini-ImageNet, tiered-ImageNet, Open MIC, fine-grained datasets such as CUB Birds, Stanford Dogs and Cars, and action recognition datasets such as HMDB51, UCF101, and mini-MIT.
arXiv Detail & Related papers (2022-01-15T19:49:00Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - MlTr: Multi-label Classification with Transformer [35.14232810099418]
We propose a Multi-label Transformer architecture (MlTr) constructed with windows partitioning, in-window pixel attention, cross-window attention.
The proposed MlTr shows state-of-the-art results on various prevalent multi-label datasets such as MS-COCO, Pascal-VOC, and NUS-WIDE.
arXiv Detail & Related papers (2021-06-11T06:53:09Z) - Semi-Supervised Domain Adaptation with Prototypical Alignment and
Consistency Learning [86.6929930921905]
This paper studies how much it can help address domain shifts if we further have a few target samples labeled.
To explore the full potential of landmarks, we incorporate a prototypical alignment (PA) module which calculates a target prototype for each class from the landmarks.
Specifically, we severely perturb the labeled images, making PA non-trivial to achieve and thus promoting model generalizability.
arXiv Detail & Related papers (2021-04-19T08:46:08Z) - A Universal Representation Transformer Layer for Few-Shot Image
Classification [43.31379752656756]
Few-shot classification aims to recognize unseen classes when presented with only a small number of samples.
We consider the problem of multi-domain few-shot image classification, where unseen classes and examples come from diverse data sources.
Here, we propose a Universal Representation Transformer layer, that meta-learns to leverage universal features for few-shot classification.
arXiv Detail & Related papers (2020-06-21T03:08:00Z) - Improving Few-shot Learning by Spatially-aware Matching and
CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario.
We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.