ViTOL: Vision Transformer for Weakly Supervised Object Localization
- URL: http://arxiv.org/abs/2204.06772v1
- Date: Thu, 14 Apr 2022 06:16:34 GMT
- Title: ViTOL: Vision Transformer for Weakly Supervised Object Localization
- Authors: Saurav Gupta, Sourav Lakhotia, Abhay Rawat, Rahul Tallamraju
- Abstract summary: Weakly supervised object localization (WSOL) aims at predicting object locations in an image using only image-level category labels.
Common challenges that image classification models encounter when localizing objects are, (a) they tend to look at the most discriminative features in an image that confines the localization map to a very small region, (b) the localization maps are class agnostic, and the models highlight objects of multiple classes in the same image.
- Score: 0.735996217853436
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Weakly supervised object localization (WSOL) aims at predicting object
locations in an image using only image-level category labels. Common challenges
that image classification models encounter when localizing objects are, (a)
they tend to look at the most discriminative features in an image that confines
the localization map to a very small region, (b) the localization maps are
class agnostic, and the models highlight objects of multiple classes in the
same image and, (c) the localization performance is affected by background
noise. To alleviate the above challenges we introduce the following simple
changes through our proposed method ViTOL. We leverage the vision-based
transformer for self-attention and introduce a patch-based attention dropout
layer (p-ADL) to increase the coverage of the localization map and a gradient
attention rollout mechanism to generate class-dependent attention maps. We
conduct extensive quantitative, qualitative and ablation experiments on the
ImageNet-1K and CUB datasets. We achieve state-of-the-art MaxBoxAcc-V2
localization scores of 70.47% and 73.17% on the two datasets respectively. Code
is available on https://github.com/Saurav-31/ViTOL
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Background Activation Suppression for Weakly Supervised Object
Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels.
New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization.
This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z) - Rethinking the Localization in Weakly Supervised Object Localization [51.29084037301646]
Weakly supervised object localization (WSOL) is one of the most popular and challenging tasks in computer vision.
Recent dividing WSOL into two parts (class-agnostic object localization and object classification) has become the state-of-the-art pipeline for this task.
We propose to replace SCR with a binary-class detector (BCD) for localizing multiple objects, where the detector is trained by discriminating the foreground and background.
arXiv Detail & Related papers (2023-08-11T14:38:51Z) - Spatial-Aware Token for Weakly Supervised Object Localization [137.0570026552845]
We propose a task-specific spatial-aware token to condition localization in a weakly supervised manner.
Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc.
arXiv Detail & Related papers (2023-03-18T15:38:17Z) - Constrained Sampling for Class-Agnostic Weakly Supervised Object
Localization [10.542859578763068]
Self-supervised vision transformers can generate accurate localization maps of the objects in an image.
We propose leveraging the multiple maps generated by the different transformer heads to acquire pseudo-labels for training a weakly-supervised object localization model.
arXiv Detail & Related papers (2022-09-09T19:58:38Z) - Discriminative Sampling of Proposals in Self-Supervised Transformers for
Weakly Supervised Object Localization [10.542859578763068]
Self-supervised vision transformers can generate accurate localization maps of the objects in an image.
We propose leveraging the multiple maps generated by the different transformer heads to acquire pseudo-labels for training a weakly-supervised object localization model.
arXiv Detail & Related papers (2022-09-09T18:33:23Z) - Re-Attention Transformer for Weakly Supervised Object Localization [45.417606565085116]
We present a re-attention mechanism termed token refinement transformer (TRT) that captures the object-level semantics to guide the localization well.
Specifically, TRT introduces a novel module named token priority scoring module (TPSM) to suppress the effects of background noise while focusing on the target object.
arXiv Detail & Related papers (2022-08-03T04:34:28Z) - Inter-Image Communication for Weakly Supervised Localization [77.2171924626778]
Weakly supervised localization aims at finding target object regions using only image-level supervision.
We propose to leverage pixel-level similarities across different objects for learning more accurate object locations.
Our method achieves the Top-1 localization error rate of 45.17% on the ILSVRC validation set.
arXiv Detail & Related papers (2020-08-12T04:14:11Z) - Rethinking the Route Towards Weakly Supervised Object Localization [28.90792512056726]
We show that weakly supervised object localization should be divided into two parts: class-agnostic object localization and object classification.
For class-agnostic object localization, we should use class-agnostic methods to generate noisy pseudo annotations and then perform bounding box regression on them without class labels.
Our PSOL models have good transferability across different datasets without fine-tuning.
arXiv Detail & Related papers (2020-02-26T08:54:20Z) - Improving Few-shot Learning by Spatially-aware Matching and
CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario.
We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.