HASSOD: Hierarchical Adaptive Self-Supervised Object Detection
- URL: http://arxiv.org/abs/2402.03311v1
- Date: Mon, 5 Feb 2024 18:59:41 GMT
- Title: HASSOD: Hierarchical Adaptive Self-Supervised Object Detection
- Authors: Shengcao Cao, Dhiraj Joshi, Liang-Yan Gui, Yu-Xiong Wang
- Abstract summary: Hierarchical Adaptive Self-Supervised Object Detection (HASSOD) is a novel approach that learns to detect objects and understand their compositions without human supervision.
We employ a hierarchical adaptive clustering strategy to group regions into object masks based on self-supervised visual representations.
HASSOD identifies the hierarchical levels of objects in terms of composition, by analyzing coverage relations between masks and constructing tree structures.
- Score: 29.776467276826747
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The human visual perception system demonstrates exceptional capabilities in
learning without explicit supervision and understanding the part-to-whole
composition of objects. Drawing inspiration from these two abilities, we
propose Hierarchical Adaptive Self-Supervised Object Detection (HASSOD), a
novel approach that learns to detect objects and understand their compositions
without human supervision. HASSOD employs a hierarchical adaptive clustering
strategy to group regions into object masks based on self-supervised visual
representations, adaptively determining the number of objects per image.
Furthermore, HASSOD identifies the hierarchical levels of objects in terms of
composition, by analyzing coverage relations between masks and constructing
tree structures. This additional self-supervised learning task leads to
improved detection performance and enhanced interpretability. Lastly, we
abandon the inefficient multi-round self-training process utilized in prior
methods and instead adapt the Mean Teacher framework from semi-supervised
learning, which leads to a smoother and more efficient training process.
Through extensive experiments on prevalent image datasets, we demonstrate the
superiority of HASSOD over existing methods, thereby advancing the state of the
art in self-supervised object detection. Notably, we improve Mask AR from 20.2
to 22.5 on LVIS, and from 17.0 to 26.0 on SA-1B. Project page:
https://HASSOD-NeurIPS23.github.io.
Related papers
- Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Weakly-supervised HOI Detection via Prior-guided Bi-level Representation
Learning [66.00600682711995]
Human object interaction (HOI) detection plays a crucial role in human-centric scene understanding and serves as a fundamental building-block for many vision tasks.
One generalizable and scalable strategy for HOI detection is to use weak supervision, learning from image-level annotations only.
This is inherently challenging due to ambiguous human-object associations, large search space of detecting HOIs and highly noisy training signal.
We develop a CLIP-guided HOI representation capable of incorporating the prior knowledge at both image level and HOI instance level, and adopt a self-taught mechanism to prune incorrect human-object associations.
arXiv Detail & Related papers (2023-03-02T14:41:31Z) - Self-Supervised Representation Learning from Temporal Ordering of
Automated Driving Sequences [49.91741677556553]
We propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks.
We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for object detection or tracking systems.
Extensive evaluations on the BDD100K, nuImages, and MOT17 datasets show that our TempO pre-training approach outperforms single-frame self-supervised learning methods.
arXiv Detail & Related papers (2023-02-17T18:18:27Z) - The Overlooked Classifier in Human-Object Interaction Recognition [82.20671129356037]
We encode the semantic correlation among classes into the classification head by initializing the weights with language embeddings of HOIs.
We propose a new loss named LSE-Sign to enhance multi-label learning on a long-tailed dataset.
Our simple yet effective method enables detection-free HOI classification, outperforming the state-of-the-arts that require object detection and human pose by a clear margin.
arXiv Detail & Related papers (2022-03-10T23:35:00Z) - Unsupervised Pretraining for Object Detection by Patch Reidentification [72.75287435882798]
Unsupervised representation learning achieves promising performances in pre-training representations for object detectors.
This work proposes a simple yet effective representation learning method for object detection, named patch re-identification (Re-ID)
Our method significantly outperforms its counterparts on COCO in all settings, such as different training iterations and data percentages.
arXiv Detail & Related papers (2021-03-08T15:13:59Z) - Using Feature Alignment Can Improve Clean Average Precision and
Adversarial Robustness in Object Detection [11.674302325688862]
We propose that using feature alignment of intermediate layer can improve clean AP and robustness in object detection.
We conduct extensive experiments on PASCAL VOC and MS-COCO datasets to verify the effectiveness of our proposed approach.
arXiv Detail & Related papers (2020-12-08T11:54:39Z) - Unsupervised Image Classification for Deep Representation Learning [42.09716669386924]
We propose an unsupervised image classification framework without using embedding clustering.
Experiments on ImageNet dataset have been conducted to prove the effectiveness of our method.
arXiv Detail & Related papers (2020-06-20T02:57:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.