Weakly Supervised Object Localization via Transformer with Implicit
Spatial Calibration
- URL: http://arxiv.org/abs/2207.10447v1
- Date: Thu, 21 Jul 2022 12:37:15 GMT
- Title: Weakly Supervised Object Localization via Transformer with Implicit
Spatial Calibration
- Authors: Haotian Bai and Ruimao Zhang and Jiong Wang and Xiang Wan
- Abstract summary: Weakly Supervised Object Localization (WSOL) has attracted much attention because of its low annotation cost in real applications.
We introduce a simple yet effective Spatial Module (SCM) for accurate WSOL, incorporating semantic similarities of patch tokens and their spatial relationships into a unified diffusion model.
SCM is designed as an external module of Transformer, and can be removed during inference to reduce the computation cost.
- Score: 20.322494442959762
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Weakly Supervised Object Localization (WSOL), which aims to localize objects
by only using image-level labels, has attracted much attention because of its
low annotation cost in real applications. Recent studies leverage the advantage
of self-attention in visual Transformer for long-range dependency to re-active
semantic regions, aiming to avoid partial activation in traditional class
activation mapping (CAM). However, the long-range modeling in Transformer
neglects the inherent spatial coherence of the object, and it usually diffuses
the semantic-aware regions far from the object boundary, making localization
results significantly larger or far smaller. To address such an issue, we
introduce a simple yet effective Spatial Calibration Module (SCM) for accurate
WSOL, incorporating semantic similarities of patch tokens and their spatial
relationships into a unified diffusion model. Specifically, we introduce a
learnable parameter to dynamically adjust the semantic correlations and spatial
context intensities for effective information propagation. In practice, SCM is
designed as an external module of Transformer, and can be removed during
inference to reduce the computation cost. The object-sensitive localization
ability is implicitly embedded into the Transformer encoder through
optimization in the training phase. It enables the generated attention maps to
capture the sharper object boundaries and filter the object-irrelevant
background area. Extensive experimental results demonstrate the effectiveness
of the proposed method, which significantly outperforms its counterpart TS-CAM
on both CUB-200 and ImageNet-1K benchmarks. The code is available at
https://github.com/164140757/SCM.
Related papers
- ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions.
Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks.
We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z) - Multiscale Vision Transformer With Deep Clustering-Guided Refinement for
Weakly Supervised Object Localization [4.300577895958228]
This work addresses the task of weakly-supervised object localization.
It comprises multiple object localization transformers that extract patch embeddings across various scales.
We introduce a deep clustering-guided refinement method that further enhances localization accuracy.
arXiv Detail & Related papers (2023-12-15T07:46:44Z) - Background Activation Suppression for Weakly Supervised Object
Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels.
New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization.
This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z) - Semantic-Constraint Matching Transformer for Weakly Supervised Object
Localization [31.039698757869974]
Weakly supervised object localization (WSOL) strives to learn to localize objects with only image-level supervision.
Previous CNN-based methods suffer from partial activation issues, concentrating on the object's discriminative part instead of the entire entity scope.
We propose a novel Semantic-Constraint Matching Network (SCMN) via a transformer to converge on the divergent activation.
arXiv Detail & Related papers (2023-09-04T03:20:31Z) - LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z) - Spatial-Aware Token for Weakly Supervised Object Localization [137.0570026552845]
We propose a task-specific spatial-aware token to condition localization in a weakly supervised manner.
Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc.
arXiv Detail & Related papers (2023-03-18T15:38:17Z) - Robust Change Detection Based on Neural Descriptor Fields [53.111397800478294]
We develop an object-level online change detection approach that is robust to partially overlapping observations and noisy localization results.
By associating objects via shape code similarity and comparing local object-neighbor spatial layout, our proposed approach demonstrates robustness to low observation overlap and localization noises.
arXiv Detail & Related papers (2022-08-01T17:45:36Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Weakly Supervised Object Localization as Domain Adaption [19.854125742336688]
Weakly supervised object localization (WSOL) focuses on localizing objects only with the supervision of image-level classification masks.
Most previous WSOL methods follow the classification activation map (CAM) that localizes objects based on the classification structure with the multi-instance learning (MIL) mechanism.
This work provides a novel perspective that models WSOL as a domain adaption (DA) task, where the score estimator trained on the source/image domain is tested on the target/pixel domain to locate objects.
arXiv Detail & Related papers (2022-03-03T13:50:22Z) - LCTR: On Awakening the Local Continuity of Transformer for Weakly
Supervised Object Localization [38.376238216214524]
Weakly supervised object localization (WSOL) aims to learn object localizer solely by using image-level labels.
We propose a novel framework built upon the transformer, termed LCTR, which targets at enhancing the local perception capability of global features.
arXiv Detail & Related papers (2021-12-10T01:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.