SPIE: Semantic and Structural Post-Training of Image Editing Diffusion Models with AI feedback
- URL: http://arxiv.org/abs/2504.12833v2
- Date: Tue, 12 Aug 2025 00:19:11 GMT
- Title: SPIE: Semantic and Structural Post-Training of Image Editing Diffusion Models with AI feedback
- Authors: Elior Benarous, Yilun Du, Heng Yang,
- Abstract summary: SPIE is a novel approach for semantic and structural post-training of instruction-based image editing diffusion models.<n>We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations.<n> Experimental results demonstrate that SPIE can perform intricate edits in complex scenes, after just 10 training steps.
- Score: 28.807572302899004
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents SPIE: a novel approach for semantic and structural post-training of instruction-based image editing diffusion models, addressing key challenges in alignment with user prompts and consistency with input images. We introduce an online reinforcement learning framework that aligns the diffusion model with human preferences without relying on extensive human annotations or curating a large dataset. Our method significantly improves the alignment with instructions and realism in two ways. First, SPIE captures fine nuances in the desired edit by leveraging a visual prompt, enabling detailed control over visual edits without lengthy textual prompts. Second, it achieves precise and structurally coherent modifications in complex scenes while maintaining high fidelity in instruction-irrelevant areas. This approach simplifies users' efforts to achieve highly specific edits, requiring only 5 reference images depicting a certain concept for training. Experimental results demonstrate that SPIE can perform intricate edits in complex scenes, after just 10 training steps. Finally, we showcase the versatility of our method by applying it to robotics, where targeted image edits enhance the visual realism of simulated environments, which improves their utility as proxy for real-world settings.
Related papers
- Image Editing As Programs with Diffusion Models [69.05164729625052]
We introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture.<n>IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations.<n>Our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions.
arXiv Detail & Related papers (2025-06-04T16:57:24Z) - Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions [20.617718631292696]
We develop a novel paradigm for instruction-driven image editing that leverages widely available and enormous text-image pairs.<n>Our approach introduces a multi-scale learnable region to localize and guide the editing process.<n>By treating the alignment between images and their textual descriptions as supervision and learning to generate task-specific editing regions, our method achieves high-fidelity, precise, and instruction-consistent image editing.
arXiv Detail & Related papers (2025-05-25T22:40:59Z) - MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection [13.467269066605452]
We propose MIND-Edit, an end-to-end image-editing framework integrating pretrained diffusion model with MLLM.<n> MIND-Edit introduces two complementary strategies: (1) a text instruction optimization strategy that clarifies ambiguous user instructions based on semantic reasoning from the MLLM, and (2) an MLLM insight-driven editing strategy that explicitly leverages the intrinsic visual understanding capability of the MLLM to infer editing intent.<n>Extensive experiments demonstrate that MIND-Edit outperforms state-of-the-art image editing methods in both quantitative metrics and visual quality, particularly under complex and challenging scenarios.
arXiv Detail & Related papers (2025-05-25T13:54:31Z) - RePrompt: Reasoning-Augmented Reprompting for Text-to-Image Generation via Reinforcement Learning [88.14234949860105]
RePrompt is a novel reprompting framework that introduces explicit reasoning into the prompt enhancement process via reinforcement learning.<n>Our approach enables end-to-end training without human-annotated data.
arXiv Detail & Related papers (2025-05-23T06:44:26Z) - PartEdit: Fine-Grained Image Editing using Pre-Trained Diffusion Models [80.98455219375862]
We present the first text-based image editing approach for object parts based on pre-trained diffusion models.<n>Our approach is preferred by users 77-90% of the time in conducted user studies.
arXiv Detail & Related papers (2025-02-06T13:08:43Z) - UIP2P: Unsupervised Instruction-based Image Editing via Cycle Edit Consistency [69.33072075580483]
We propose an unsupervised model for instruction-based image editing that eliminates the need for ground-truth edited images during training.<n>Our method addresses these challenges by introducing a novel editing mechanism called Cycle Edit Consistency ( CEC)<n> CEC applies forward and backward edits in one training step and enforces consistency in image and attention spaces.
arXiv Detail & Related papers (2024-12-19T18:59:58Z) - A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.
We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts.
We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z) - SmartEdit: Exploring Complex Instruction-based Image Editing with
Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing.
It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities.
We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z) - Guiding Instruction-based Image Editing via Multimodal Large Language
Models [102.82211398699644]
Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation.
We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE)
MGIE learns to derive expressive instructions and provides explicit guidance.
arXiv Detail & Related papers (2023-09-29T10:01:50Z) - ReGeneration Learning of Diffusion Models with Rich Prompts for
Zero-Shot Image Translation [8.803251014279502]
Large-scale text-to-image models have demonstrated amazing ability to synthesize diverse and high-fidelity images.
Current models can impose significant changes to the original image content during the editing process.
We propose ReGeneration learning in an image-to-image Diffusion model (ReDiffuser)
arXiv Detail & Related papers (2023-05-08T12:08:12Z) - StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [115.49488548588305]
A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images.<n>They either finetune the model, or invert the image in the latent space of the pretrained model.<n>They suffer from two problems: Unsatisfying results for selected regions and unexpected changes in non-selected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z) - Zero-shot Image-to-Image Translation [57.46189236379433]
We propose pix2pix-zero, an image-to-image translation method that can preserve the original image without manual prompting.
We propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process.
Our method does not need additional training for these edits and can directly use the existing text-to-image diffusion model.
arXiv Detail & Related papers (2023-02-06T18:59:51Z) - Direct Inversion: Optimization-Free Text-Driven Real Image Editing with
Diffusion Models [0.0]
We propose an optimization-free and zero fine-tuning framework that applies complex and non-rigid edits to a single real image via a text prompt.
We prove our method's efficacy in producing high-quality, diverse, semantically coherent, and faithful real image edits.
arXiv Detail & Related papers (2022-11-15T01:07:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.