Neural Congealing: Aligning Images to a Joint Semantic Atlas
- URL: http://arxiv.org/abs/2302.03956v1
- Date: Wed, 8 Feb 2023 09:26:22 GMT
- Title: Neural Congealing: Aligning Images to a Joint Semantic Atlas
- Authors: Dolev Ofri-Amar, Michal Geyer, Yoni Kasten, Tali Dekel
- Abstract summary: We present a zero-shot self-supervised framework for aligning semantically-common content across a set of images.
Our approach harnesses the power of pre-trained DINO-ViT features to learn.
We show that our method performs favorably compared to a state-of-the-art method that requires extensive training on large-scale datasets.
- Score: 14.348512536556413
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Neural Congealing -- a zero-shot self-supervised framework for
detecting and jointly aligning semantically-common content across a given set
of images. Our approach harnesses the power of pre-trained DINO-ViT features to
learn: (i) a joint semantic atlas -- a 2D grid that captures the mode of
DINO-ViT features in the input set, and (ii) dense mappings from the unified
atlas to each of the input images. We derive a new robust self-supervised
framework that optimizes the atlas representation and mappings per image set,
requiring only a few real-world images as input without any additional input
information (e.g., segmentation masks). Notably, we design our losses and
training paradigm to account only for the shared content under severe
variations in appearance, pose, background clutter or other distracting
objects. We demonstrate results on a plethora of challenging image sets
including sets of mixed domains (e.g., aligning images depicting sculpture and
artwork of cats), sets depicting related yet different object categories (e.g.,
dogs and tigers), or domains for which large-scale training data is scarce
(e.g., coffee mugs). We thoroughly evaluate our method and show that our
test-time optimization approach performs favorably compared to a
state-of-the-art method that requires extensive training on large-scale
datasets.
Related papers
- Co-Segmentation without any Pixel-level Supervision with Application to Large-Scale Sketch Classification [3.3104978705632777]
We propose a novel method for object co-segmentation, i.e. pixel-level localization of a common object in a set of images.
The method achieves state-of-the-art performance among methods trained with the same level of supervision.
The benefits of the proposed co-segmentation method are further demonstrated in the task of large-scale sketch recognition.
arXiv Detail & Related papers (2024-10-17T14:16:45Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - ASIC: Aligning Sparse in-the-wild Image Collections [86.66498558225625]
We present a method for joint alignment of sparse in-the-wild image collections of an object category.
We use pairwise nearest neighbors obtained from deep features of a pre-trained vision transformer (ViT) model as noisy and sparse keypoint matches.
Experiments on CUB and SPair-71k benchmarks demonstrate that our method can produce globally consistent and higher quality correspondences.
arXiv Detail & Related papers (2023-03-28T17:59:28Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Rectifying the Shortcut Learning of Background: Shared Object
Concentration for Few-Shot Image Recognition [101.59989523028264]
Few-Shot image classification aims to utilize pretrained knowledge learned from a large-scale dataset to tackle a series of downstream classification tasks.
We propose COSOC, a novel Few-Shot Learning framework, to automatically figure out foreground objects at both pretraining and evaluation stage.
arXiv Detail & Related papers (2021-07-16T07:46:41Z) - Multimodal Contrastive Training for Visual Representation Learning [45.94662252627284]
We develop an approach to learning visual representations that embraces multimodal data.
Our method exploits intrinsic data properties within each modality and semantic information from cross-modal correlation simultaneously.
By including multimodal training in a unified framework, our method can learn more powerful and generic visual features.
arXiv Detail & Related papers (2021-04-26T19:23:36Z) - Instance Localization for Self-supervised Detection Pretraining [68.24102560821623]
We propose a new self-supervised pretext task, called instance localization.
We show that integration of bounding boxes into pretraining promotes better task alignment and architecture alignment for transfer learning.
Experimental results demonstrate that our approach yields state-of-the-art transfer learning results for object detection.
arXiv Detail & Related papers (2021-02-16T17:58:57Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z) - Contextual Encoder-Decoder Network for Visual Saliency Prediction [42.047816176307066]
We propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task.
We combine the resulting representations with global scene information for accurately predicting visual saliency.
Compared to state of the art approaches, the network is based on a lightweight image classification backbone.
arXiv Detail & Related papers (2019-02-18T16:15:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.