Emergent Correspondence from Image Diffusion
- URL: http://arxiv.org/abs/2306.03881v2
- Date: Wed, 6 Dec 2023 17:58:25 GMT
- Title: Emergent Correspondence from Image Diffusion
- Authors: Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, Bharath
Hariharan
- Abstract summary: We show that correspondence emerges in image diffusion models without any explicit supervision.
We propose a strategy to extract this implicit knowledge out of diffusion networks as image features, namely DIffusion FeaTures (DIFT)
DIFT is able to outperform both weakly-supervised methods and competitive off-the-shelf features in identifying semantic, geometric, and temporal correspondences.
- Score: 56.29904609646015
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Finding correspondences between images is a fundamental problem in computer
vision. In this paper, we show that correspondence emerges in image diffusion
models without any explicit supervision. We propose a simple strategy to
extract this implicit knowledge out of diffusion networks as image features,
namely DIffusion FeaTures (DIFT), and use them to establish correspondences
between real images. Without any additional fine-tuning or supervision on the
task-specific data or annotations, DIFT is able to outperform both
weakly-supervised methods and competitive off-the-shelf features in identifying
semantic, geometric, and temporal correspondences. Particularly for semantic
correspondence, DIFT from Stable Diffusion is able to outperform DINO and
OpenCLIP by 19 and 14 accuracy points respectively on the challenging SPair-71k
benchmark. It even outperforms the state-of-the-art supervised methods on 9 out
of 18 categories while remaining on par for the overall performance. Project
page: https://diffusionfeatures.github.io
Related papers
- Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning [71.14084801851381]
Change captioning aims to succinctly describe the semantic change between a pair of similar images.
Most existing methods directly capture the difference between them, which risk obtaining error-prone difference features.
We propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations.
arXiv Detail & Related papers (2024-07-16T13:00:33Z) - Implicit and Explicit Language Guidance for Diffusion-based Visual Perception [42.71751651417168]
Text-to-image diffusion models can generate high-quality images with rich texture and reasonable structure under different text prompts.
We propose an implicit and explicit language guidance framework for diffusion-based perception, named IEDP.
Our IEDP achieves promising performance on two typical perception tasks, including semantic segmentation and depth estimation.
arXiv Detail & Related papers (2024-04-11T09:39:58Z) - Unsupervised Semantic Correspondence Using Stable Diffusion [27.355330079806027]
We show that one can leverage this semantic knowledge within diffusion models to find semantic correspondences.
We optimize the prompt embeddings of these models for maximum attention on the regions of interest.
We significantly outperform any existing weakly or unsupervised method on PF-Willow, CUB-200 and SPair-71k datasets.
arXiv Detail & Related papers (2023-05-24T21:34:34Z) - A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot
Semantic Correspondence [83.90531416914884]
We exploit Stable Diffusion features for semantic and dense correspondence.
With simple post-processing, SD features can perform quantitatively similar to SOTA representations.
We show that these correspondences can enable interesting applications such as instance swapping in two images.
arXiv Detail & Related papers (2023-05-24T16:59:26Z) - Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence [88.00004819064672]
Diffusion Hyperfeatures is a framework for consolidating multi-scale and multi-timestep feature maps into per-pixel feature descriptors.
Our method achieves superior performance on the SPair-71k real image benchmark.
arXiv Detail & Related papers (2023-05-23T17:58:05Z) - TopicFM: Robust and Interpretable Feature Matching with Topic-assisted [8.314830611853168]
We propose an architecture for image matching which is efficient, robust, and interpretable.
We introduce a novel feature matching module called TopicFM which can roughly organize same spatial structure across images into a topic.
Our method can only perform matching in co-visibility regions to reduce computations.
arXiv Detail & Related papers (2022-07-01T10:39:14Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - AugNet: End-to-End Unsupervised Visual Representation Learning with
Image Augmentation [3.6790362352712873]
We propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures.
Our experiments demonstrate that the method is able to represent the image in low dimensional space.
Unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets.
arXiv Detail & Related papers (2021-06-11T09:02:30Z) - TIME: Text and Image Mutual-Translation Adversarial Networks [55.1298552773457]
We propose Text and Image Mutual-Translation Adversarial Networks (TIME)
TIME learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework.
In experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB and MS-COCO dataset.
arXiv Detail & Related papers (2020-05-27T06:40:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.