Self-Supervised Learning for Visual Relationship Detection through
Masked Bounding Box Reconstruction
- URL: http://arxiv.org/abs/2311.04834v1
- Date: Wed, 8 Nov 2023 16:59:26 GMT
- Title: Self-Supervised Learning for Visual Relationship Detection through
Masked Bounding Box Reconstruction
- Authors: Zacharias Anastasakis, Dimitrios Mallis, Markos Diomataris, George
Alexandridis, Stefanos Kollias, Vassilis Pitsikalis
- Abstract summary: We present a novel self-supervised approach for representation learning, particularly for the task of Visual Relationship Detection (VRD)
Motivated by the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding Box Reconstruction (MBBR)
- Score: 6.798515070856465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel self-supervised approach for representation learning,
particularly for the task of Visual Relationship Detection (VRD). Motivated by
the effectiveness of Masked Image Modeling (MIM), we propose Masked Bounding
Box Reconstruction (MBBR), a variation of MIM where a percentage of the
entities/objects within a scene are masked and subsequently reconstructed based
on the unmasked objects. The core idea is that, through object-level masked
modeling, the network learns context-aware representations that capture the
interaction of objects within a scene and thus are highly predictive of visual
object relationships. We extensively evaluate learned representations, both
qualitatively and quantitatively, in a few-shot setting and demonstrate the
efficacy of MBBR for learning robust visual representations, particularly
tailored for VRD. The proposed method is able to surpass state-of-the-art VRD
methods on the Predicate Detection (PredDet) evaluation setting, using only a
few annotated samples. We make our code available at
https://github.com/deeplab-ai/SelfSupervisedVRD.
Related papers
- Pluralistic Salient Object Detection [108.74650817891984]
We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image.
We present two new SOD datasets "DUTS-MM" and "DUS-MQ", along with newly designed evaluation metrics.
arXiv Detail & Related papers (2024-09-04T01:38:37Z) - Attention-Guided Masked Autoencoders For Learning Image Representations [16.257915216763692]
Masked autoencoders (MAEs) have established themselves as a powerful method for unsupervised pre-training for computer vision tasks.
We propose to inform the reconstruction process through an attention-guided loss function.
Our evaluations show that our pre-trained models learn better latent representations than the vanilla MAE.
arXiv Detail & Related papers (2024-02-23T08:11:25Z) - CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding.
CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches.
With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Understanding Self-Supervised Pretraining with Part-Aware Representation
Learning [88.45460880824376]
We study the capability that self-supervised representation pretraining methods learn part-aware representations.
Results show that the fully-supervised model outperforms self-supervised models for object-level recognition.
arXiv Detail & Related papers (2023-01-27T18:58:42Z) - i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? [26.146459754995597]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain.
This paper aims to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability.
In addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space.
arXiv Detail & Related papers (2022-10-20T17:59:54Z) - Object-wise Masked Autoencoders for Fast Pre-training [13.757095663704858]
We show that current masked image encoding models learn the underlying relationship between all objects in the whole scene, instead of a single object representation.
We introduce a novel object selection and division strategy to drop non-object patches for learning object-wise representations by selective reconstruction with interested region masks.
Experiments on four commonly-used datasets demonstrate the effectiveness of our model in reducing the compute cost by 72% while achieving competitive performance.
arXiv Detail & Related papers (2022-05-28T05:13:45Z) - Self-Supervised Visual Representations Learning by Contrastive Mask
Prediction [129.25459808288025]
We propose a novel contrastive mask prediction (CMP) task for visual representation learning.
MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions.
We evaluate MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2.
arXiv Detail & Related papers (2021-08-18T02:50:33Z) - Efficient Object-Level Visual Context Modeling for Multimodal Machine
Translation: Masking Irrelevant Objects Helps Grounding [25.590409802797538]
We propose an object-level visual context modeling framework (OVC) to efficiently capture and explore visual information for multimodal machine translation.
OVC encourages MMT to ground translation on desirable visual objects by masking irrelevant objects in the visual modality.
Experiments on MMT datasets demonstrate that the proposed OVC model outperforms state-of-the-art MMT models.
arXiv Detail & Related papers (2020-12-18T11:10:00Z) - Visual Relationship Detection with Visual-Linguistic Knowledge from
Multimodal Representations [103.00383924074585]
Visual relationship detection aims to reason over relationships among salient objects in images.
We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT)
RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
arXiv Detail & Related papers (2020-09-10T16:15:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.