MOST: Multiple Object localization with Self-supervised Transformers for
object discovery
- URL: http://arxiv.org/abs/2304.05387v2
- Date: Sat, 26 Aug 2023 23:25:27 GMT
- Title: MOST: Multiple Object localization with Self-supervised Transformers for
object discovery
- Authors: Sai Saketh Rambhatla, Ishan Misra, Rama Chellappa, Abhinav Shrivastava
- Abstract summary: We present Multiple Object localization with Self-supervised Transformers (MOST)
MOST uses features of transformers trained using self-supervised learning to localize multiple objects in real world images.
We show MOST can be used for self-supervised pre-training of object detectors, and yields consistent improvements on fully, semi-supervised object detection and unsupervised region proposal generation.
- Score: 97.47075050779085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We tackle the challenging task of unsupervised object localization in this
work. Recently, transformers trained with self-supervised learning have been
shown to exhibit object localization properties without being trained for this
task. In this work, we present Multiple Object localization with
Self-supervised Transformers (MOST) that uses features of transformers trained
using self-supervised learning to localize multiple objects in real world
images. MOST analyzes the similarity maps of the features using box counting; a
fractal analysis tool to identify tokens lying on foreground patches. The
identified tokens are then clustered together, and tokens of each cluster are
used to generate bounding boxes on foreground regions. Unlike recent
state-of-the-art object localization methods, MOST can localize multiple
objects per image and outperforms SOTA algorithms on several object
localization and discovery benchmarks on PASCAL-VOC 07, 12 and COCO20k
datasets. Additionally, we show that MOST can be used for self-supervised
pre-training of object detectors, and yields consistent improvements on fully,
semi-supervised object detection and unsupervised region proposal generation.
Related papers
- Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification [63.147482497821166]
We first explore the influence of global and local features of ViT and then propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID.
Our proposed method achieves superior performance on four object Re-ID benchmarks.
arXiv Detail & Related papers (2024-04-23T12:42:07Z) - Multiscale Vision Transformer With Deep Clustering-Guided Refinement for
Weakly Supervised Object Localization [4.300577895958228]
This work addresses the task of weakly-supervised object localization.
It comprises multiple object localization transformers that extract patch embeddings across various scales.
We introduce a deep clustering-guided refinement method that further enhances localization accuracy.
arXiv Detail & Related papers (2023-12-15T07:46:44Z) - Background Activation Suppression for Weakly Supervised Object
Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels.
New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization.
This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z) - ECEA: Extensible Co-Existing Attention for Few-Shot Object Detection [52.16237548064387]
Few-shot object detection (FSOD) identifies objects from extremely few annotated samples.
Most existing FSOD methods, recently, apply the two-stage learning paradigm, which transfers the knowledge learned from abundant base classes to assist the few-shot detectors by learning the global features.
We propose an Extensible Co-Existing Attention (ECEA) module to enable the model to infer the global object according to the local parts.
arXiv Detail & Related papers (2023-09-15T06:55:43Z) - Constrained Sampling for Class-Agnostic Weakly Supervised Object
Localization [10.542859578763068]
Self-supervised vision transformers can generate accurate localization maps of the objects in an image.
We propose leveraging the multiple maps generated by the different transformer heads to acquire pseudo-labels for training a weakly-supervised object localization model.
arXiv Detail & Related papers (2022-09-09T19:58:38Z) - Discriminative Sampling of Proposals in Self-Supervised Transformers for
Weakly Supervised Object Localization [10.542859578763068]
Self-supervised vision transformers can generate accurate localization maps of the objects in an image.
We propose leveraging the multiple maps generated by the different transformer heads to acquire pseudo-labels for training a weakly-supervised object localization model.
arXiv Detail & Related papers (2022-09-09T18:33:23Z) - Weakly Supervised Object Localization as Domain Adaption [19.854125742336688]
Weakly supervised object localization (WSOL) focuses on localizing objects only with the supervision of image-level classification masks.
Most previous WSOL methods follow the classification activation map (CAM) that localizes objects based on the classification structure with the multi-instance learning (MIL) mechanism.
This work provides a novel perspective that models WSOL as a domain adaption (DA) task, where the score estimator trained on the source/image domain is tested on the target/pixel domain to locate objects.
arXiv Detail & Related papers (2022-03-03T13:50:22Z) - LCTR: On Awakening the Local Continuity of Transformer for Weakly
Supervised Object Localization [38.376238216214524]
Weakly supervised object localization (WSOL) aims to learn object localizer solely by using image-level labels.
We propose a novel framework built upon the transformer, termed LCTR, which targets at enhancing the local perception capability of global features.
arXiv Detail & Related papers (2021-12-10T01:48:40Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - Robust Object Detection via Instance-Level Temporal Cycle Confusion [89.1027433760578]
We study the effectiveness of auxiliary self-supervised tasks to improve the out-of-distribution generalization of object detectors.
Inspired by the principle of maximum entropy, we introduce a novel self-supervised task, instance-level temporal cycle confusion (CycConf)
For each object, the task is to find the most different object proposals in the adjacent frame in a video and then cycle back to itself for self-supervision.
arXiv Detail & Related papers (2021-04-16T21:35:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.