Related papers: Instruction-based Image Editing with Planning, Reasoning, and Generation

Instruction-based Image Editing with Planning, Reasoning, and Generation

URL: http://arxiv.org/abs/2602.22624v1
Date: Thu, 26 Feb 2026 04:56:02 GMT
Title: Instruction-based Image Editing with Planning, Reasoning, and Generation
Authors: Liya Ji, Chenyang Qi, Qifeng Chen,
Abstract summary: Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task.<n>We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models.<n>Our method has competitive editing abilities on complex real-world images.
Score: 52.0364486403062
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.

Related papers

TalkPhoto: A Versatile Training-Free Conversational Assistant for Intelligent Image Editing [21.708181904910177]
Multimodal Large Language Models (MLLMs) promote information exchange between instructions and images.<n>These frameworks often build a multi-instruction dataset to train the model to handle multiple editing tasks.<n>We present TalkPhoto, a versatile training-free image editing framework that facilitates precise image manipulation through conversational interaction.
arXiv Detail & Related papers (2026-01-05T09:00:32Z)
DreamOmni2: Multimodal Instruction-based Editing and Generation [77.997848231822]
We propose two novel tasks: multimodal instruction-based editing and generation.<n>These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts.<n>Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing.
arXiv Detail & Related papers (2025-10-08T06:07:14Z)
EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning [58.53074381801114]
We introduce EditVerse, a unified framework for image and video generation and editing within a single model.<n>By representing all modalities, i.e. text, image, and video, as a unified token sequence, EditVerse leverages self-attention to achieve robust in-context learning.<n>We present EditVerseBench, the first benchmark for instruction-based video editing covering diverse tasks and resolutions.
arXiv Detail & Related papers (2025-09-24T17:59:30Z)
Beyond Editing Pairs: Fine-Grained Instructional Image Editing via Multi-Scale Learnable Regions [20.617718631292696]
We develop a novel paradigm for instruction-driven image editing that leverages widely available and enormous text-image pairs.<n>Our approach introduces a multi-scale learnable region to localize and guide the editing process.<n>By treating the alignment between images and their textual descriptions as supervision and learning to generate task-specific editing regions, our method achieves high-fidelity, precise, and instruction-consistent image editing.
arXiv Detail & Related papers (2025-05-25T22:40:59Z)
Image Inpainting Models are Effective Tools for Instruction-guided Image Editing [42.63350374074953]
This technique report is for the winning solution of the CVPR2024 GenAI Media Generation Challenge Workshop's Instruction-guided Image Editing track. We use a 4-step process IIIE (Inpainting-based Instruction-guided Image Editing): editing category classification, main editing object identification, editing mask acquisition, and image inpainting. Results show that through proper combinations of language models and image inpainting models, our pipeline can reach a high success rate with satisfying visual quality.
arXiv Detail & Related papers (2024-07-18T03:55:33Z)
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z)
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing. It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities. We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z)
Learning by Planning: Language-Guided Global Image Editing [53.72807421111136]
We develop a text-to-operation model to map the vague editing language request into a series of editing operations. The only supervision in the task is the target image, which is insufficient for a stable training of sequential decisions. We propose a novel operation planning algorithm to generate possible editing sequences from the target image as pseudo ground truth.
arXiv Detail & Related papers (2021-06-24T16:30:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.