Spatial-Aware Token for Weakly Supervised Object Localization
- URL: http://arxiv.org/abs/2303.10438v2
- Date: Wed, 9 Aug 2023 07:43:25 GMT
- Title: Spatial-Aware Token for Weakly Supervised Object Localization
- Authors: Pingyu Wu, Wei Zhai, Yang Cao, Jiebo Luo, Zheng-Jun Zha
- Abstract summary: We propose a task-specific spatial-aware token to condition localization in a weakly supervised manner.
Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc.
- Score: 137.0570026552845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised object localization (WSOL) is a challenging task aiming to
localize objects with only image-level supervision. Recent works apply visual
transformer to WSOL and achieve significant success by exploiting the
long-range feature dependency in self-attention mechanism. However, existing
transformer-based methods synthesize the classification feature maps as the
localization map, which leads to optimization conflicts between classification
and localization tasks. To address this problem, we propose to learn a
task-specific spatial-aware token (SAT) to condition localization in a weakly
supervised manner. Specifically, a spatial token is first introduced in the
input space to aggregate representations for localization task. Then a spatial
aware attention module is constructed, which allows spatial token to generate
foreground probabilities of different patches by querying and to extract
localization knowledge from the classification task. Besides, for the problem
of sparse and unbalanced pixel-level supervision obtained from the image-level
label, two spatial constraints, including batch area loss and normalization
loss, are designed to compensate and enhance this supervision. Experiments show
that the proposed SAT achieves state-of-the-art performance on both CUB-200 and
ImageNet, with 98.45% and 73.13% GT-known Loc, respectively. Even under the
extreme setting of using only 1 image per class from ImageNet for training, SAT
already exceeds the SOTA method by 2.1% GT-known Loc. Code and models are
available at https://github.com/wpy1999/SAT.
Related papers
- Decoupled DETR: Spatially Disentangling Localization and Classification
for Improved End-to-End Object Detection [48.429555904690595]
We introduce spatially decoupled DETR, which includes a task-aware query generation module and a disentangled feature learning process.
We demonstrate that our approach achieves a significant improvement in MSCOCO datasets compared to previous work.
arXiv Detail & Related papers (2023-10-24T15:54:11Z) - Background Activation Suppression for Weakly Supervised Object
Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels.
New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization.
This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z) - Semantic-Constraint Matching Transformer for Weakly Supervised Object
Localization [31.039698757869974]
Weakly supervised object localization (WSOL) strives to learn to localize objects with only image-level supervision.
Previous CNN-based methods suffer from partial activation issues, concentrating on the object's discriminative part instead of the entire entity scope.
We propose a novel Semantic-Constraint Matching Network (SCMN) via a transformer to converge on the divergent activation.
arXiv Detail & Related papers (2023-09-04T03:20:31Z) - Rethinking the Localization in Weakly Supervised Object Localization [51.29084037301646]
Weakly supervised object localization (WSOL) is one of the most popular and challenging tasks in computer vision.
Recent dividing WSOL into two parts (class-agnostic object localization and object classification) has become the state-of-the-art pipeline for this task.
We propose to replace SCR with a binary-class detector (BCD) for localizing multiple objects, where the detector is trained by discriminating the foreground and background.
arXiv Detail & Related papers (2023-08-11T14:38:51Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Boosting Few-shot Fine-grained Recognition with Background Suppression
and Foreground Alignment [53.401889855278704]
Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples.
We propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local to local (L2L) similarity metric.
Experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-10-04T07:54:40Z) - Weakly Supervised Object Localization via Transformer with Implicit
Spatial Calibration [20.322494442959762]
Weakly Supervised Object Localization (WSOL) has attracted much attention because of its low annotation cost in real applications.
We introduce a simple yet effective Spatial Module (SCM) for accurate WSOL, incorporating semantic similarities of patch tokens and their spatial relationships into a unified diffusion model.
SCM is designed as an external module of Transformer, and can be removed during inference to reduce the computation cost.
arXiv Detail & Related papers (2022-07-21T12:37:15Z) - ViTOL: Vision Transformer for Weakly Supervised Object Localization [0.735996217853436]
Weakly supervised object localization (WSOL) aims at predicting object locations in an image using only image-level category labels.
Common challenges that image classification models encounter when localizing objects are, (a) they tend to look at the most discriminative features in an image that confines the localization map to a very small region, (b) the localization maps are class agnostic, and the models highlight objects of multiple classes in the same image.
arXiv Detail & Related papers (2022-04-14T06:16:34Z) - TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised
Object Localization [112.46381729542658]
Weakly supervised object localization (WSOL) is a challenging problem when given image category labels.
We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction.
arXiv Detail & Related papers (2021-03-27T09:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.