Related papers: OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

URL: http://arxiv.org/abs/2411.07199v1
Date: Mon, 11 Nov 2024 18:21:43 GMT
Title: OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
Authors: Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, Wenhu Chen,
Abstract summary: We present omniedit, an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. We provide images with different aspect ratios to ensure that our model can handle any image in the wild.
Score: 32.33777277141083
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at \url{https://tiger-ai-lab.github.io/OmniEdit/}

Related papers

ImgEdit: A Unified Image Editing Dataset and Benchmark [14.185771939071149]
We introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs.<n>ImgEdit surpasses existing datasets in both task novelty and data quality.<n>For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance.
arXiv Detail & Related papers (2025-05-26T17:53:33Z)
Step1X-Edit: A Practical Framework for General Image Editing [64.07202539610576]
We release a state-of-the-art image editing model, called Step1X-Edit. It can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions.
arXiv Detail & Related papers (2025-04-24T17:25:12Z)
EditAR: Unified Conditional Generation with Autoregressive Models [58.093860528672735]
We propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks. The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm. We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods.
arXiv Detail & Related papers (2025-01-08T18:59:35Z)
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea [88.79769371584491]
We present AnyEdit, a comprehensive multi-modal instruction editing dataset. We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results. Experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models.
arXiv Detail & Related papers (2024-11-24T07:02:56Z)
Multi-Reward as Condition for Instruction-based Image Editing [32.77114231615961]
We propose to address the training data quality issue with multi-perspective reward data instead of refining the ground-truth image quality. Experiments indicate that our multi-reward conditioned model outperforms its no-reward counterpart on two popular editing pipelines.
arXiv Detail & Related papers (2024-11-06T05:02:29Z)
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z)
Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection [60.47731445033151]
We propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model. Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.
arXiv Detail & Related papers (2024-05-27T04:44:36Z)
Real-time 3D-aware Portrait Editing from a Single Image [111.27169315556444]
3DPE can edit a face image following given prompts, like reference images or text descriptions. A lightweight module is distilled from a 3D portrait generator and a text-to-image model.
arXiv Detail & Related papers (2024-02-21T18:36:26Z)
Free-Editor: Zero-shot Text-driven 3D Scene Editing [8.966537479017951]
Training a diffusion model specifically for 3D scene editing is challenging due to the scarcity of large-scale datasets. We introduce a novel, training-free 3D scene editing technique called textscFree-Editor, which enables users to edit 3D scenes without the need for model retraining. Our method effectively addresses the issue of multi-view style inconsistency found in state-of-the-art (SOTA) methods.
arXiv Detail & Related papers (2023-12-21T08:40:57Z)
Emu Edit: Precise Image Editing via Recognition and Generation Tasks [62.95717180730946]
We present Emu Edit, a multi-task image editing model which sets state-of-the-art results in instruction-based image editing. We train it to multi-task across an unprecedented range of tasks, such as region-based editing, free-form editing, and Computer Vision tasks. We show that Emu Edit can generalize to new tasks, such as image inpainting, super-resolution, and compositions of editing tasks, with just a few labeled examples.
arXiv Detail & Related papers (2023-11-16T18:55:58Z)
FEC: Three Finetuning-free Methods to Enhance Consistency for Real Image Editing [0.0]
We propose FEC, which consists of three sampling methods, each designed for different editing types and settings. FEC achieves two important goals in image editing task: 1) ensuring successful reconstruction, i.e., sampling to get a generated result that preserves the texture and features of the original real image. None of our sampling methods require fine-tuning of the diffusion model or time-consuming training on large-scale datasets.
arXiv Detail & Related papers (2023-09-26T13:43:06Z)
Editing 3D Scenes via Text Prompts without Retraining [80.57814031701744]
DN2N is a text-driven editing method that allows for the direct acquisition of a NeRF model with universal editing capabilities. Our method employs off-the-shelf text-based editing models of 2D images to modify the 3D scene images. Our method achieves multiple editing types, including but not limited to appearance editing, weather transition, material changing, and style transfer.
arXiv Detail & Related papers (2023-09-10T02:31:50Z)
SINE: SINgle Image Editing with Text-to-Image Diffusion Models [10.67527134198167]
This work aims to address the problem of single-image editing. We propose a novel model-based guidance built upon the classifier-free guidance. We show promising editing capabilities, including changing style, content addition, and object manipulation.
arXiv Detail & Related papers (2022-12-08T18:57:13Z)
EdiBERT, a generative model for image editing [12.605607949417033]
EdiBERT is a bi-directional transformer trained in the discrete latent space built by a vector-quantized auto-encoder. We show that the resulting model matches state-of-the-art performances on a wide variety of tasks.
arXiv Detail & Related papers (2021-11-30T10:23:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.