Learning Action and Reasoning-Centric Image Editing from Videos and Simulations
- URL: http://arxiv.org/abs/2407.03471v3
- Date: Thu, 17 Oct 2024 15:12:44 GMT
- Title: Learning Action and Reasoning-Centric Image Editing from Videos and Simulations
- Authors: Benno Krojer, Dheeraj Vattikonda, Luis Lara, Varun Jampani, Eva Portelance, Christopher Pal, Siva Reddy,
- Abstract summary: AURORA dataset is a collection of high-quality training data, human-annotated and curated from videos and simulation engines.
We evaluate an AURORA-finetuned model on a new expert-curated benchmark covering 8 diverse editing tasks.
Our model significantly outperforms previous editing models as judged by human raters.
- Score: 45.637947364341436
- License:
- Abstract: An image editing model should be able to perform diverse edits, ranging from object replacement, changing attributes or style, to performing actions or movement, which require many forms of reasoning. Current general instruction-guided editing models have significant shortcomings with action and reasoning-centric edits. Object, attribute or stylistic changes can be learned from visually static datasets. On the other hand, high-quality data for action and reasoning-centric edits is scarce and has to come from entirely different sources that cover e.g. physical dynamics, temporality and spatial reasoning. To this end, we meticulously curate the AURORA Dataset (Action-Reasoning-Object-Attribute), a collection of high-quality training data, human-annotated and curated from videos and simulation engines. We focus on a key aspect of quality training data: triplets (source image, prompt, target image) contain a single meaningful visual change described by the prompt, i.e., truly minimal changes between source and target images. To demonstrate the value of our dataset, we evaluate an AURORA-finetuned model on a new expert-curated benchmark (AURORA-Bench) covering 8 diverse editing tasks. Our model significantly outperforms previous editing models as judged by human raters. For automatic evaluations, we find important flaws in previous metrics and caution their use for semantically hard editing tasks. Instead, we propose a new automatic metric that focuses on discriminative understanding. We hope that our efforts : (1) curating a quality training dataset and an evaluation benchmark, (2) developing critical evaluations, and (3) releasing a state-of-the-art model, will fuel further progress on general image editing.
Related papers
- PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM [17.89238060470998]
evaluating diffusion-based image-editing models is a crucial task in the field of Generative AI.
Our benchmark, PixLens, provides a comprehensive evaluation of both edit quality and latent representation disentanglement.
arXiv Detail & Related papers (2024-10-08T06:05:15Z) - Learning Feature-Preserving Portrait Editing from Generated Pairs [11.122956539965761]
We propose a training-based method leveraging auto-generated paired data to learn desired editing.
Our method achieves state-of-the-art quality, quantitatively and qualitatively.
arXiv Detail & Related papers (2024-07-29T23:19:42Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Learning to Follow Object-Centric Image Editing Instructions Faithfully [26.69032113274608]
Current approaches focusing on image editing with natural language instructions rely on automatically generated paired data.
We significantly improve the quality of the paired data and enhance the supervision signal.
Our model is capable of performing fine-grained object-centric edits better than state-of-the-art baselines.
arXiv Detail & Related papers (2023-10-29T20:39:11Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Memory-Based Model Editing at Scale [102.28475739907498]
Existing model editors struggle to accurately model an edit's intended scope.
We propose Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model (SERAC)
SERAC stores edits in an explicit memory and learns to reason over them to modulate the base model's predictions as needed.
arXiv Detail & Related papers (2022-06-13T23:40:34Z) - End-to-End Visual Editing with a Generatively Pre-Trained Artist [78.5922562526874]
We consider the targeted image editing problem: blending a region in a source image with a driver image that specifies the desired change.
We propose a self-supervised approach that simulates edits by augmenting off-the-shelf images in a target domain.
We show that different blending effects can be learned by an intuitive control of the augmentation process, with no other changes required to the model architecture.
arXiv Detail & Related papers (2022-05-03T17:59:30Z) - Enjoy Your Editing: Controllable GANs for Image Editing via Latent Space
Navigation [136.53288628437355]
Controllable semantic image editing enables a user to change entire image attributes with few clicks.
Current approaches often suffer from attribute edits that are entangled, global image identity changes, and diminished photo-realism.
We propose quantitative evaluation strategies for measuring controllable editing performance, unlike prior work which primarily focuses on qualitative evaluation.
arXiv Detail & Related papers (2021-02-01T21:38:36Z) - TailorGAN: Making User-Defined Fashion Designs [28.805686791183618]
We propose a novel self-supervised model to synthesize garment images with disentangled attributes without paired data.
Our method consists of a reconstruction learning step and an adversarial learning step.
Experiments on this dataset and real-world samples demonstrate that our method can synthesize much better results than the state-of-the-art methods.
arXiv Detail & Related papers (2020-01-17T16:54:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.