Learning Flow Fields in Attention for Controllable Person Image Generation
- URL: http://arxiv.org/abs/2412.08486v2
- Date: Thu, 12 Dec 2024 18:43:39 GMT
- Title: Learning Flow Fields in Attention for Controllable Person Image Generation
- Authors: Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa, Aditya Patel, Tao Xiang, Miaojing Shi, Sen He,
- Abstract summary: Controllable person image generation aims to generate a person image conditioned on reference images.
We propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key.
Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion.
- Score: 59.10843756343987
- License:
- Abstract: Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.
Related papers
- Masked Extended Attention for Zero-Shot Virtual Try-On In The Wild [17.025262797698364]
Virtual Try-On aims to replace a piece of garment in an image with one from another, while preserving person and garment characteristics as well as image fidelity.
Current literature takes a supervised approach for the task, impairing generalization and imposing heavy computation.
We present a novel zero-shot training-free method for inpainting a clothing garment by reference.
arXiv Detail & Related papers (2024-06-21T17:45:37Z) - DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images
with Free Attention Masks [64.67735676127208]
Text-to-image diffusion models have shown great potential for benefiting image recognition.
Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images.
We introduce customized solutions by fully exploiting the aforementioned free attention masks.
arXiv Detail & Related papers (2023-08-13T10:07:46Z) - Masked Image Training for Generalizable Deep Image Denoising [53.03126421917465]
We present a novel approach to enhance the generalization performance of denoising networks.
Our method involves masking random pixels of the input image and reconstructing the missing information during training.
Our approach exhibits better generalization ability than other deep learning models and is directly applicable to real-world scenarios.
arXiv Detail & Related papers (2023-03-23T09:33:44Z) - LTT-GAN: Looking Through Turbulence by Inverting GANs [86.25869403782957]
We propose the first turbulence mitigation method that makes use of visual priors encapsulated by a well-trained GAN.
Based on the visual priors, we propose to learn to preserve the identity of restored images on a periodic contextual distance.
Our method significantly outperforms prior art in both the visual quality and face verification accuracy of restored results.
arXiv Detail & Related papers (2021-12-04T16:42:13Z) - Towards Unsupervised Deep Image Enhancement with Generative Adversarial
Network [92.01145655155374]
We present an unsupervised image enhancement generative network (UEGAN)
It learns the corresponding image-to-image mapping from a set of images with desired characteristics in an unsupervised manner.
Results show that the proposed model effectively improves the aesthetic quality of images.
arXiv Detail & Related papers (2020-12-30T03:22:46Z) - Learning Edge-Preserved Image Stitching from Large-Baseline Deep
Homography [32.28310831466225]
We propose an image stitching learning framework, which consists of a large-baseline deep homography module and an edge-preserved deformation module.
Our method is superior to the existing learning method and shows competitive performance with state-of-the-art traditional methods.
arXiv Detail & Related papers (2020-12-11T08:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.