Disentangling Structure and Appearance in ViT Feature Space
- URL: http://arxiv.org/abs/2311.12193v1
- Date: Mon, 20 Nov 2023 21:20:15 GMT
- Title: Disentangling Structure and Appearance in ViT Feature Space
- Authors: Narek Tumanyan, Omer Bar-Tal, Shir Amir, Shai Bagon, Tali Dekel
- Abstract summary: We present a method for semantically transferring the visual appearance of one natural image to another.
Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image.
We propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain.
- Score: 26.233355454282446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a method for semantically transferring the visual appearance of
one natural image to another. Specifically, our goal is to generate an image in
which objects in a source structure image are "painted" with the visual
appearance of their semantically related objects in a target appearance image.
To integrate semantic information into our framework, our key idea is to
leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically,
we derive novel disentangled representations of structure and appearance
extracted from deep ViT features. We then establish an objective function that
splices the desired structure and appearance representations, interweaving them
together in the space of ViT features. Based on our objective function, we
propose two frameworks of semantic appearance transfer -- "Splice", which works
by training a generator on a single and arbitrary pair of structure-appearance
images, and "SpliceNet", a feed-forward real-time appearance transfer model
trained on a dataset of images from a specific domain. Our frameworks do not
involve adversarial training, nor do they require any additional input
information such as semantic segmentation or correspondences. We demonstrate
high-resolution results on a variety of in-the-wild image pairs, under
significant variations in the number of objects, pose, and appearance. Code and
supplementary material are available in our project page: splice-vit.github.io.
Related papers
- GroundingBooth: Grounding Text-to-Image Customization [17.185571339157075]
We introduce GroundingBooth, a framework that achieves zero-shot instance-level spatial grounding on both foreground subjects and background objects.
Our proposed text-image grounding module and masked cross-attention layer allow us to generate personalized images with both accurate layout alignment and identity preservation.
arXiv Detail & Related papers (2024-09-13T03:40:58Z) - Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP [53.18562650350898]
We introduce a general framework which can identify the roles of various components in ViTs beyond CLIP.
We also introduce a novel scoring function to rank components by their importance with respect to specific features.
Applying our framework to various ViT variants we gain insights into the roles of different components concerning particular image features.
arXiv Detail & Related papers (2024-06-03T17:58:43Z) - Cross-Image Attention for Zero-Shot Appearance Transfer [68.43651329067393]
We introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images.
We harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process.
Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint.
arXiv Detail & Related papers (2023-11-06T18:33:24Z) - Neural Congealing: Aligning Images to a Joint Semantic Atlas [14.348512536556413]
We present a zero-shot self-supervised framework for aligning semantically-common content across a set of images.
Our approach harnesses the power of pre-trained DINO-ViT features to learn.
We show that our method performs favorably compared to a state-of-the-art method that requires extensive training on large-scale datasets.
arXiv Detail & Related papers (2023-02-08T09:26:22Z) - Bringing Image Scene Structure to Video via Frame-Clip Consistency of
Object Tokens [93.98605636451806]
StructureViT shows how utilizing the structure of a small number of images only available during training can improve a video model.
SViT shows strong performance improvements on multiple video understanding tasks and datasets.
arXiv Detail & Related papers (2022-06-13T17:45:05Z) - Splicing ViT Features for Semantic Appearance Transfer [10.295754142142686]
We present a method for semantically transferring the visual appearance of one natural image to another.
Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image.
arXiv Detail & Related papers (2022-01-02T22:00:34Z) - SAC-GAN: Structure-Aware Image-to-Image Composition for Self-Driving [18.842432515507035]
We present a compositional approach to image augmentation for self-driving applications.
It is an end-to-end neural network that is trained to seamlessly compose an object represented as a cropped patch from an object image, into a background scene image.
We evaluate our network, coined SAC-GAN for structure-aware composition, on prominent self-driving datasets in terms of quality, composability, and generalizability of the composite images.
arXiv Detail & Related papers (2021-12-13T12:24:50Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories.
In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture.
The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z) - Object-Centric Image Generation from Layouts [93.10217725729468]
We develop a layout-to-image-generation method to generate complex scenes with multiple objects.
Our method learns representations of the spatial relationships between objects in the scene, which lead to our model's improved layout-fidelity.
We introduce SceneFID, an object-centric adaptation of the popular Fr'echet Inception Distance metric, that is better suited for multi-object images.
arXiv Detail & Related papers (2020-03-16T21:40:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.