Splicing ViT Features for Semantic Appearance Transfer
- URL: http://arxiv.org/abs/2201.00424v1
- Date: Sun, 2 Jan 2022 22:00:34 GMT
- Title: Splicing ViT Features for Semantic Appearance Transfer
- Authors: Narek Tumanyan, Omer Bar-Tal, Shai Bagon, Tali Dekel
- Abstract summary: We present a method for semantically transferring the visual appearance of one natural image to another.
Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image.
- Score: 10.295754142142686
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a method for semantically transferring the visual appearance of
one natural image to another. Specifically, our goal is to generate an image in
which objects in a source structure image are "painted" with the visual
appearance of their semantically related objects in a target appearance image.
Our method works by training a generator given only a single
structure/appearance image pair as input. To integrate semantic information
into our framework - a pivotal component in tackling this task - our key idea
is to leverage a pre-trained and fixed Vision Transformer (ViT) model which
serves as an external semantic prior. Specifically, we derive novel
representations of structure and appearance extracted from deep ViT features,
untwisting them from the learned self-attention modules. We then establish an
objective function that splices the desired structure and appearance
representations, interweaving them together in the space of ViT features. Our
framework, which we term "Splice", does not involve adversarial training, nor
does it require any additional input information such as semantic segmentation
or correspondences, and can generate high-resolution results, e.g., work in HD.
We demonstrate high quality results on a variety of in-the-wild image pairs,
under significant variations in the number of objects, their pose and
appearance.
Related papers
- Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP [53.18562650350898]
We introduce a general framework which can identify the roles of various components in ViTs beyond CLIP.
We also introduce a novel scoring function to rank components by their importance with respect to specific features.
Applying our framework to various ViT variants we gain insights into the roles of different components concerning particular image features.
arXiv Detail & Related papers (2024-06-03T17:58:43Z) - Disentangling Structure and Appearance in ViT Feature Space [26.233355454282446]
We present a method for semantically transferring the visual appearance of one natural image to another.
Specifically, our goal is to generate an image in which objects in a source structure image are "painted" with the visual appearance of their semantically related objects in a target appearance image.
We propose two frameworks of semantic appearance transfer -- "Splice", which works by training a generator on a single and arbitrary pair of structure-appearance images, and "SpliceNet", a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain.
arXiv Detail & Related papers (2023-11-20T21:20:15Z) - Cross-Image Attention for Zero-Shot Appearance Transfer [68.43651329067393]
We introduce a cross-image attention mechanism that implicitly establishes semantic correspondences across images.
We harness three mechanisms that either manipulate the noisy latent codes or the model's internal representations throughout the denoising process.
Experiments show that our method is effective across a wide range of object categories and is robust to variations in shape, size, and viewpoint.
arXiv Detail & Related papers (2023-11-06T18:33:24Z) - Neural Congealing: Aligning Images to a Joint Semantic Atlas [14.348512536556413]
We present a zero-shot self-supervised framework for aligning semantically-common content across a set of images.
Our approach harnesses the power of pre-trained DINO-ViT features to learn.
We show that our method performs favorably compared to a state-of-the-art method that requires extensive training on large-scale datasets.
arXiv Detail & Related papers (2023-02-08T09:26:22Z) - SAC-GAN: Structure-Aware Image-to-Image Composition for Self-Driving [18.842432515507035]
We present a compositional approach to image augmentation for self-driving applications.
It is an end-to-end neural network that is trained to seamlessly compose an object represented as a cropped patch from an object image, into a background scene image.
We evaluate our network, coined SAC-GAN for structure-aware composition, on prominent self-driving datasets in terms of quality, composability, and generalizability of the composite images.
arXiv Detail & Related papers (2021-12-13T12:24:50Z) - Semantic-Aware Generation for Self-Supervised Visual Representation
Learning [116.5814634936371]
We advocate for Semantic-aware Generation (SaGe) to facilitate richer semantics rather than details to be preserved in the generated image.
SaGe complements the target network with view-specific features and thus alleviates the semantic degradation brought by intensive data augmentations.
We execute SaGe on ImageNet-1K and evaluate the pre-trained models on five downstream tasks including nearest neighbor test, linear classification, and fine-scaled image recognition.
arXiv Detail & Related papers (2021-11-25T16:46:13Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories.
In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture.
The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.