Learning an Image Editing Model without Image Editing Pairs
- URL: http://arxiv.org/abs/2510.14978v1
- Date: Thu, 16 Oct 2025 17:59:57 GMT
- Title: Learning an Image Editing Model without Image Editing Pairs
- Authors: Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, Xun Huang,
- Abstract summary: Recent image editing models have achieved impressive results while following natural language editing instructions.<n>They rely on supervised fine-tuning with large datasets of input-target pairs.<n>Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models.<n>We present a new training paradigm that eliminates the need for paired data entirely.
- Score: 83.03646586929638
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models. However, this can propagate and magnify the artifacts of the pretrained model into the final trained model. In this work, we present a new training paradigm that eliminates the need for paired data entirely. Our approach directly optimizes a few-step diffusion model by unrolling it during training and leveraging feedback from vision-language models (VLMs). For each input and editing instruction, the VLM evaluates if an edit follows the instruction and preserves unchanged content, providing direct gradients for end-to-end optimization. To ensure visual fidelity, we incorporate distribution matching loss (DMD), which constrains generated images to remain within the image manifold learned by pretrained models. We evaluate our method on standard benchmarks and include an extensive ablation study. Without any paired data, our method performs on par with various image editing diffusion models trained on extensive supervised paired data, under the few-step setting. Given the same VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
Related papers
- Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback [41.41713036839503]
We introduce Edit-R1, a novel post-training framework for instruction-based image editing based on policy optimization.<n>We employ a Multimodal Large Language Model (MLLM) as a unified, training-free reward model, leveraging its output logits to provide fine-grained feedback.<n>Our framework is model-agnostic, delivering substantial performance gains when applied to diverse base models.
arXiv Detail & Related papers (2025-10-19T15:38:06Z) - Regularized Personalization of Text-to-Image Diffusion Models without Distributional Drift [5.608240462042483]
Personalization using text-to-image diffusion models involves adapting a pretrained model to novel subjects with only a few image examples.<n>Forgetting denotes unintended distributional drift, where the model's output distribution deviates from that of the original pretrained model.<n>We propose a new training objective based on a Lipschitz-bounded formulation that explicitly constrains deviation from the pretrained distribution.
arXiv Detail & Related papers (2025-05-26T05:03:59Z) - Decouple-Then-Merge: Finetune Diffusion Models as Multi-Task Learning [45.89372687373466]
Diffusion models are trained by learning a sequence of models that reverse each step of noise corruption.<n>The parameters are fully shared across multiple timesteps to enhance training efficiency.<n>However, since the denoising tasks differ at each timestep, the gradients computed at different timesteps may conflict, potentially degrading the overall performance of image generation.
arXiv Detail & Related papers (2024-10-09T08:19:25Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder [13.453138169497903]
SeNM-VAE is a semi-supervised noise modeling method that leverages both paired and unpaired datasets to generate realistic degraded data.
We employ our method to generate paired training samples for real-world image denoising and super-resolution tasks.
Our approach excels in the quality of synthetic degraded images compared to other unpaired and paired noise modeling methods.
arXiv Detail & Related papers (2024-03-26T09:03:40Z) - Do the Frankenstein, or how to achieve better out-of-distribution
performance with manifold mixing model soup [1.0878040851637998]
We show that the fused model gives significantly better out-of-distribution performance when finetuning a CLIP model for image classification.
It provides also better accuracy on the original dataset where the finetuning has been done.
arXiv Detail & Related papers (2023-08-28T06:13:32Z) - DINOv2: Learning Robust Visual Features without Supervision [75.42921276202522]
This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources.
Most of the technical contributions aim at accelerating and stabilizing the training at scale.
In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.
arXiv Detail & Related papers (2023-04-14T15:12:19Z) - Masked Images Are Counterfactual Samples for Robust Fine-tuning [77.82348472169335]
Fine-tuning deep learning models can lead to a trade-off between in-distribution (ID) performance and out-of-distribution (OOD) robustness.
We propose a novel fine-tuning method, which uses masked images as counterfactual samples that help improve the robustness of the fine-tuning model.
arXiv Detail & Related papers (2023-03-06T11:51:28Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [47.432215933099016]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.<n>This creates a barrier to fusing knowledge across individual models to yield a better single model.<n>We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - Fast Model Editing at Scale [77.69220974621425]
We propose Model Editor Networks with Gradient Decomposition (MEND)
MEND is a collection of small auxiliary editing networks that use a single desired input-output pair to make fast, local edits to a pre-trained model.
MEND can be trained on a single GPU in less than a day even for 10 billion+ parameter models.
arXiv Detail & Related papers (2021-10-21T17:41:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.