A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot
Semantic Correspondence
- URL: http://arxiv.org/abs/2305.15347v2
- Date: Tue, 28 Nov 2023 17:47:46 GMT
- Title: A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot
Semantic Correspondence
- Authors: Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera,
Varun Jampani, Deqing Sun, Ming-Hsuan Yang
- Abstract summary: We exploit Stable Diffusion features for semantic and dense correspondence.
With simple post-processing, SD features can perform quantitatively similar to SOTA representations.
We show that these correspondences can enable interesting applications such as instance swapping in two images.
- Score: 83.90531416914884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image diffusion models have made significant advances in generating
and editing high-quality images. As a result, numerous approaches have explored
the ability of diffusion model features to understand and process single images
for downstream tasks, e.g., classification, semantic segmentation, and
stylization. However, significantly less is known about what these features
reveal across multiple, different images and objects. In this work, we exploit
Stable Diffusion (SD) features for semantic and dense correspondence and
discover that with simple post-processing, SD features can perform
quantitatively similar to SOTA representations. Interestingly, the qualitative
analysis reveals that SD features have very different properties compared to
existing representation learning features, such as the recently released
DINOv2: while DINOv2 provides sparse but accurate matches, SD features provide
high-quality spatial information but sometimes inaccurate semantic matches. We
demonstrate that a simple fusion of these two features works surprisingly well,
and a zero-shot evaluation using nearest neighbors on these fused features
provides a significant performance gain over state-of-the-art methods on
benchmark datasets, e.g., SPair-71k, PF-Pascal, and TSS. We also show that
these correspondences can enable interesting applications such as instance
swapping in two images.
Related papers
- Leveraging Semantic Cues from Foundation Vision Models for Enhanced Local Feature Correspondence [12.602194710071116]
This paper presents a new method that uses semantic cues from foundation vision model features to enhance local feature matching.
We present adapted versions of six existing descriptors, with an average increase in performance of 29% in camera localization.
arXiv Detail & Related papers (2024-10-12T13:45:26Z) - Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++.
It models two common event representations simultaneously, i.e., event images and event voxels.
We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence [88.00004819064672]
Diffusion Hyperfeatures is a framework for consolidating multi-scale and multi-timestep feature maps into per-pixel feature descriptors.
Our method achieves superior performance on the SPair-71k real image benchmark.
arXiv Detail & Related papers (2023-05-23T17:58:05Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z) - Semantic Prompt for Few-Shot Image Recognition [76.68959583129335]
We propose a novel Semantic Prompt (SP) approach for few-shot learning.
The proposed approach achieves promising results, improving the 1-shot learning accuracy by 3.67% on average.
arXiv Detail & Related papers (2023-03-24T16:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.