InstructRL4Pix: Training Diffusion for Image Editing by Reinforcement Learning
- URL: http://arxiv.org/abs/2406.09973v1
- Date: Fri, 14 Jun 2024 12:31:48 GMT
- Title: InstructRL4Pix: Training Diffusion for Image Editing by Reinforcement Learning
- Authors: Tiancheng Li, Jinxiu Liu, Huajun Chen, Qi Liu,
- Abstract summary: We propose Reinforcement Learning Guided Image Editing Method(InstructRL4Pix) to train a diffusion model to generate images that are guided by the attention maps of the target object.
Experimental results show that InstructRL4Pix breaks through the limitations of traditional datasets and uses unsupervised learning to optimize editing goals and achieve accurate image editing based on natural human commands.
- Score: 31.799923647356458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-based image editing has made a great process in using natural human language to manipulate the visual content of images. However, existing models are limited by the quality of the dataset and cannot accurately localize editing regions in images with complex object relationships. In this paper, we propose Reinforcement Learning Guided Image Editing Method(InstructRL4Pix) to train a diffusion model to generate images that are guided by the attention maps of the target object. Our method maximizes the output of the reward model by calculating the distance between attention maps as a reward function and fine-tuning the diffusion model using proximal policy optimization (PPO). We evaluate our model in object insertion, removal, replacement, and transformation. Experimental results show that InstructRL4Pix breaks through the limitations of traditional datasets and uses unsupervised learning to optimize editing goals and achieve accurate image editing based on natural human commands.
Related papers
- PrefPaint: Aligning Image Inpainting Diffusion Model with Human Preference [62.72779589895124]
We make the first attempt to align diffusion models for image inpainting with human aesthetic standards via a reinforcement learning framework.
We train a reward model with a dataset we construct, consisting of nearly 51,000 images annotated with human preferences.
Experiments on inpainting comparison and downstream tasks, such as image extension and 3D reconstruction, demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-10-29T11:49:39Z) - Image Inpainting Models are Effective Tools for Instruction-guided Image Editing [42.63350374074953]
This technique report is for the winning solution of the CVPR2024 GenAI Media Generation Challenge Workshop's Instruction-guided Image Editing track.
We use a 4-step process IIIE (Inpainting-based Instruction-guided Image Editing): editing category classification, main editing object identification, editing mask acquisition, and image inpainting.
Results show that through proper combinations of language models and image inpainting models, our pipeline can reach a high success rate with satisfying visual quality.
arXiv Detail & Related papers (2024-07-18T03:55:33Z) - DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - Diffusion Model-Based Image Editing: A Survey [46.244266782108234]
Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks.
We provide an exhaustive overview of existing methods using diffusion models for image editing.
To further evaluate the performance of text-guided image editing algorithms, we propose a systematic benchmark, EditEval.
arXiv Detail & Related papers (2024-02-27T14:07:09Z) - VASE: Object-Centric Appearance and Shape Manipulation of Real Videos [108.60416277357712]
In this work, we introduce a framework that is object-centric and is designed to control both the object's appearance and, notably, to execute precise and explicit structural modifications on the object.
We build our framework on a pre-trained image-conditioned diffusion model, integrate layers to handle the temporal dimension, and propose training strategies and architectural modifications to enable shape control.
We evaluate our method on the image-driven video editing task showing similar performance to the state-of-the-art, and showcasing novel shape-editing capabilities.
arXiv Detail & Related papers (2024-01-04T18:59:24Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models [4.855820180160146]
We propose a novel diffusion-based image editing framework with pixel-wise guidance.
We demonstrate that our proposal outperforms the GAN-based method for editing quality and speed.
arXiv Detail & Related papers (2022-12-05T04:39:08Z) - InstructPix2Pix: Learning to Follow Image Editing Instructions [103.77092910685764]
We propose a method for editing images from human instructions.
given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image.
We show compelling editing results for a diverse collection of input images and written instructions.
arXiv Detail & Related papers (2022-11-17T18:58:43Z) - Look here! A parametric learning based approach to redirect visual
attention [49.609412873346386]
We introduce an automatic method to make an image region more attention-capturing via subtle image edits.
Our model predicts a distinct set of global parametric transformations to be applied to the foreground and background image regions.
Our edits enable inference at interactive rates on any image size, and easily generalize to videos.
arXiv Detail & Related papers (2020-08-12T16:08:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.