Related papers: Swapping Autoencoder for Deep Image Manipulation

Swapping Autoencoder for Deep Image Manipulation

URL: http://arxiv.org/abs/2007.00653v2
Date: Mon, 14 Dec 2020 09:41:33 GMT
Title: Swapping Autoencoder for Deep Image Manipulation
Authors: Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli Shechtman, Alexei A. Efros, Richard Zhang
Abstract summary: We propose the Swapping Autoencoder, a deep model designed specifically for image manipulation. The key idea is to encode an image with two independent components and enforce that any swapped combination maps to a realistic image. Experiments on multiple datasets show that our model produces better results and is substantially more efficient compared to recent generative models.
Score: 94.33114146172606
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep generative models have become increasingly effective at producing realistic images from randomly sampled seeds, but using such models for controllable manipulation of existing images remains challenging. We propose the Swapping Autoencoder, a deep model designed specifically for image manipulation, rather than random sampling. The key idea is to encode an image with two independent components and enforce that any swapped combination maps to a realistic image. In particular, we encourage the components to represent structure and texture, by enforcing one component to encode co-occurrent patch statistics across different parts of an image. As our method is trained with an encoder, finding the latent codes for a new input image becomes trivial, rather than cumbersome. As a result, it can be used to manipulate real input images in various ways, including texture swapping, local and global editing, and latent code vector arithmetic. Experiments on multiple datasets show that our model produces better results and is substantially more efficient compared to recent generative models.

Related papers

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think [38.258453761376586]
We propose Dream Engine, an efficient framework designed for arbitrary text-image interleaved control in image generation models. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark.
arXiv Detail & Related papers (2025-02-27T15:08:39Z)
JetFormer: An Autoregressive Generative Model of Raw Images and Text [62.2573739835562]
We propose an autoregressive decoder-only transformer - JetFormer - which is trained to directly maximize the likelihood of raw data. We leverage a normalizing flow model to obtain a soft-token image representation that is jointly trained with an autoregressive multimodal transformer. JetFormer achieves text-to-image generation quality competitive with recent VQ-VAE- and VAE-based baselines.
arXiv Detail & Related papers (2024-11-29T14:14:59Z)
Closed-Loop Transcription via Convolutional Sparse Coding [29.75613581643052]
Autoencoders often use generic deep networks as the encoder or decoder, which are difficult to interpret. In this work, we make the explicit assumption that the image distribution is generated from a multistage convolution sparse coding (CSC) Our method enjoys several side benefits, including more structured and interpretable representations, more stable convergence, and scalability to large datasets.
arXiv Detail & Related papers (2023-02-18T14:40:07Z)
ClipCrop: Conditioned Cropping Driven by Vision-Language Model [90.95403416150724]
We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms. We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance. Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
arXiv Detail & Related papers (2022-11-21T14:27:07Z)
SISL:Self-Supervised Image Signature Learning for Splicing Detection and Localization [11.437760125881049]
We propose self-supervised approach for training splicing detection/localization models from frequency transforms of images. Our proposed model can yield similar or better performances on standard datasets without relying on labels or metadata.
arXiv Detail & Related papers (2022-03-15T12:26:29Z)
EdiBERT, a generative model for image editing [12.605607949417033]
EdiBERT is a bi-directional transformer trained in the discrete latent space built by a vector-quantized auto-encoder. We show that the resulting model matches state-of-the-art performances on a wide variety of tasks.
arXiv Detail & Related papers (2021-11-30T10:23:06Z)
StyleMapGAN: Exploiting Spatial Dimensions of Latent in GAN for Real-time Image Editing [19.495153059077367]
Generative adversarial networks (GANs) synthesize realistic images from random latent vectors. Editing real images with GANs suffers from i) time-consuming optimization for projecting real images to the latent vectors, ii) or inaccurate embedding through an encoder. We propose StyleMapGAN: the intermediate latent space has spatial dimensions, and a spatially variant replaces AdaIN.
arXiv Detail & Related papers (2021-04-30T04:43:24Z)
Ensembling with Deep Generative Views [72.70801582346344]
generative models can synthesize "views" of artificial images that mimic real-world variations, such as changes in color or pose. Here, we investigate whether such views can be applied to real images to benefit downstream analysis tasks such as image classification. We use StyleGAN2 as the source of generative augmentations and investigate this setup on classification tasks involving facial attributes, cat faces, and cars.
arXiv Detail & Related papers (2021-04-29T17:58:35Z)
Free-Form Image Inpainting via Contrastive Attention Network [64.05544199212831]
In image inpainting tasks, masks with any shapes can appear anywhere in images which form complex patterns. It is difficult for encoders to capture such powerful representations under this complex situation. We propose a self-supervised Siamese inference network to improve the robustness and generalization.
arXiv Detail & Related papers (2020-10-29T14:46:05Z)
Intrinsic Autoencoders for Joint Neural Rendering and Intrinsic Image Decomposition [67.9464567157846]
We propose an autoencoder for joint generation of realistic images from synthetic 3D models while simultaneously decomposing real images into their intrinsic shape and appearance properties. Our experiments confirm that a joint treatment of rendering and decomposition is indeed beneficial and that our approach outperforms state-of-the-art image-to-image translation baselines both qualitatively and quantitatively.
arXiv Detail & Related papers (2020-06-29T12:53:58Z)
Locally Masked Convolution for Autoregressive Models [107.4635841204146]
LMConv is a simple modification to the standard 2D convolution that allows arbitrary masks to be applied to the weights at each location in the image. We learn an ensemble of distribution estimators that share parameters but differ in generation order, achieving improved performance on whole-image density estimation.
arXiv Detail & Related papers (2020-06-22T17:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.