Cross-Image Attention for Zero-Shot Appearance Transfer
- URL: http://arxiv.org/abs/2311.03335v1
- Date: Mon, 6 Nov 2023 18:33:24 GMT
- Title: Cross-Image Attention for Zero-Shot Appearance Transfer
- Authors: Yuval Alaluf, Daniel Garibi, Or Patashnik, Hadar Averbuch-Elor, Daniel
Cohen-Or
- Abstract summary: We introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images.
We harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process.
Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint.
- Score: 68.43651329067393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in text-to-image generative models have demonstrated a
remarkable ability to capture a deep semantic understanding of images. In this
work, we leverage this semantic knowledge to transfer the visual appearance
between objects that share similar semantics but may differ significantly in
shape. To achieve this, we build upon the self-attention layers of these
generative models and introduce a cross-image attention mechanism that
implicitly establishes semantic correspondences across images. Specifically,
given a pair of images -- one depicting the target structure and the other
specifying the desired appearance -- our cross-image attention combines the
queries corresponding to the structure image with the keys and values of the
appearance image. This operation, when applied during the denoising process,
leverages the established semantic correspondences to generate an image
combining the desired structure and appearance. In addition, to improve the
output image quality, we harness three mechanisms that either manipulate the
noisy latent codes or the model's internal representations throughout the
denoising process. Importantly, our approach is zero-shot, requiring no
optimization or training. Experiments show that our method is effective across
a wide range of object categories and is robust to variations in shape, size,
and viewpoint between the two input images.
Related papers
- Disentangling Structure and Appearance in ViT Feature Space [26.233355454282446]
We present a method for semantically transferring the visual appearance of one natural image to another.
Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image.
We propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain.
arXiv Detail & Related papers (2023-11-20T21:20:15Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Masked Image Modeling with Denoising Contrast [30.31920660487222]
Masked image modeling dominates this line of research with state-of-the-art performance on vision Transformers.
We introduce a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints.
ConMIM-pretrained vision Transformers with various scales achieve promising results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks.
arXiv Detail & Related papers (2022-05-19T15:22:29Z) - Splicing ViT Features for Semantic Appearance Transfer [10.295754142142686]
We present a method for semantically transferring the visual appearance of one natural image to another.
Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image.
arXiv Detail & Related papers (2022-01-02T22:00:34Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z) - Co-Attention for Conditioned Image Matching [91.43244337264454]
We propose a new approach to determine correspondences between image pairs in the wild under large changes in illumination, viewpoint, context, and material.
While other approaches find correspondences between pairs of images by treating the images independently, we instead condition on both images to implicitly take account of the differences between them.
arXiv Detail & Related papers (2020-07-16T17:32:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.