Multi-Reward as Condition for Instruction-based Image Editing
- URL: http://arxiv.org/abs/2411.04713v1
- Date: Wed, 06 Nov 2024 05:02:29 GMT
- Title: Multi-Reward as Condition for Instruction-based Image Editing
- Authors: Xin Gu, Ming Li, Libo Zhang, Fan Chen, Longyin Wen, Tiejian Luo, Sijie Zhu,
- Abstract summary: We propose to address the training data quality issue with multi-perspective reward data instead of refining the ground-truth image quality.
Experiments indicate that our multi-reward conditioned model outperforms its no-reward counterpart on two popular editing pipelines.
- Score: 32.77114231615961
- License:
- Abstract: High-quality training triplets (instruction, original image, edited image) are essential for instruction-based image editing. Predominant training datasets (e.g., InsPix2Pix) are created using text-to-image generative models (e.g., Stable Diffusion, DALL-E) which are not trained for image editing. Accordingly, these datasets suffer from inaccurate instruction following, poor detail preserving, and generation artifacts. In this paper, we propose to address the training data quality issue with multi-perspective reward data instead of refining the ground-truth image quality. 1) we first design a quantitative metric system based on best-in-class LVLM (Large Vision Language Model), i.e., GPT-4o in our case, to evaluate the generation quality from 3 perspectives, namely, instruction following, detail preserving, and generation quality. For each perspective, we collected quantitative score in $0\sim 5$ and text descriptive feedback on the specific failure points in ground-truth edited images, resulting in a high-quality editing reward dataset, i.e., RewardEdit20K. 2) We further proposed a novel training framework to seamlessly integrate the metric output, regarded as multi-reward, into editing models to learn from the imperfect training triplets. During training, the reward scores and text descriptions are encoded as embeddings and fed into both the latent space and the U-Net of the editing models as auxiliary conditions. During inference, we set these additional conditions to the highest score with no text description for failure points, to aim at the best generation outcome. Experiments indicate that our multi-reward conditioned model outperforms its no-reward counterpart on two popular editing pipelines, i.e., InsPix2Pix and SmartEdit. The code and dataset will be released.
Related papers
- Learning Action and Reasoning-Centric Image Editing from Videos and Simulations [45.637947364341436]
AURORA dataset is a collection of high-quality training data, human-annotated and curated from videos and simulation engines.
We evaluate an AURORA-finetuned model on a new expert-curated benchmark covering 8 diverse editing tasks.
Our model significantly outperforms previous editing models as judged by human raters.
arXiv Detail & Related papers (2024-07-03T19:36:33Z) - ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing [77.12834553200632]
We introduce ReasonPix2Pix, a comprehensive reasoning-attentive instruction editing dataset.
The dataset is characterized by 1) reasoning instruction, 2) more realistic images from fine-grained categories, and 3) increased variances between input and edited images.
When fine-tuned with our dataset under supervised conditions, the model demonstrates superior performance in instructional editing tasks, independent of whether the tasks require reasoning or not.
arXiv Detail & Related papers (2024-05-18T06:03:42Z) - Retrieval Enhanced Zero-Shot Video Captioning [69.96136689829778]
We bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP.
Experiments show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing [38.13162627140172]
HQ-Edit is a high-quality instruction-based image editing dataset with around 200,000 edits.
To ensure its high quality, diverse examples are first collected online, expanded, and then used to create high-quality diptychs.
HQ-Edits high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing models.
arXiv Detail & Related papers (2024-04-15T17:59:31Z) - Learning to Follow Object-Centric Image Editing Instructions Faithfully [26.69032113274608]
Current approaches focusing on image editing with natural language instructions rely on automatically generated paired data.
We significantly improve the quality of the paired data and enhance the supervision signal.
Our model is capable of performing fine-grained object-centric edits better than state-of-the-art baselines.
arXiv Detail & Related papers (2023-10-29T20:39:11Z) - ImagenHub: Standardizing the evaluation of conditional image generation
models [48.51117156168]
This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all conditional image generation models.
We design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images.
Our human evaluation achieves a high inter-worker agreement of Krippendorff's alpha on 76% models with a value higher than 0.4.
arXiv Detail & Related papers (2023-10-02T19:41:42Z) - MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing [48.204992417461575]
We introduce MagicBrush, the first large-scale, manually annotated dataset for instruction-guided real image editing.
We show that the new model can produce much better images according to human evaluation.
arXiv Detail & Related papers (2023-06-16T17:58:58Z) - iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing.
It generates images conditioned on a source image and a textual edit prompt.
It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z) - InstructPix2Pix: Learning to Follow Image Editing Instructions [103.77092910685764]
We propose a method for editing images from human instructions.
given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image.
We show compelling editing results for a diverse collection of input images and written instructions.
arXiv Detail & Related papers (2022-11-17T18:58:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.