Lego-Edit: A General Image Editing Framework with Model-Level Bricks and MLLM Builder
- URL: http://arxiv.org/abs/2509.12883v1
- Date: Tue, 16 Sep 2025 09:36:17 GMT
- Title: Lego-Edit: A General Image Editing Framework with Model-Level Bricks and MLLM Builder
- Authors: Qifei Jia, Yu Liu, Yajie Chai, Xintong Yao, Qiming Lu, Yasen Zhang, Runyu Shi, Ying Huang, Guoquan Zhang,
- Abstract summary: Lego-Edit organizes a suite of model-level editing tools to tackle this challenge.<n>Lego-Edit incorporates two key designs: (1) a model-level toolkit comprising diverse models efficiently trained on limited data and several image manipulation functions, enabling fine-grained composition of editing actions by the MLLM; and (2) a three-stage progressive reinforcement learning approach that uses feedback on unannotated, open-domain instructions to train the MLLM.
- Score: 5.7175031100654055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-based image editing has garnered significant attention due to its direct interaction with users. However, real-world user instructions are immensely diverse, and existing methods often fail to generalize effectively to instructions outside their training domain, limiting their practical application. To address this, we propose Lego-Edit, which leverages the generalization capability of Multi-modal Large Language Model (MLLM) to organize a suite of model-level editing tools to tackle this challenge. Lego-Edit incorporates two key designs: (1) a model-level toolkit comprising diverse models efficiently trained on limited data and several image manipulation functions, enabling fine-grained composition of editing actions by the MLLM; and (2) a three-stage progressive reinforcement learning approach that uses feedback on unannotated, open-domain instructions to train the MLLM, equipping it with generalized reasoning capabilities for handling real-world instructions. Experiments demonstrate that Lego-Edit achieves state-of-the-art performance on GEdit-Bench and ImgBench. It exhibits robust reasoning capabilities for open-domain instructions and can utilize newly introduced editing tools without additional fine-tuning. Code is available: https://github.com/xiaomi-research/lego-edit.
Related papers
- TalkPhoto: A Versatile Training-Free Conversational Assistant for Intelligent Image Editing [21.708181904910177]
Multimodal Large Language Models (MLLMs) promote information exchange between instructions and images.<n>These frameworks often build a multi-instruction dataset to train the model to handle multiple editing tasks.<n>We present TalkPhoto, a versatile training-free image editing framework that facilitates precise image manipulation through conversational interaction.
arXiv Detail & Related papers (2026-01-05T09:00:32Z) - ReasonEdit: Towards Reasoning-Enhanced Image Editing Models [60.902953259781675]
A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder.<n>We show that unlocking the reasoning capabilities of MLLM can push the boundaries of editing models.<n>Our proposed framework enables image editing in a thinking-editing-reflection loop.
arXiv Detail & Related papers (2025-11-27T17:02:48Z) - CompBench: Benchmarking Complex Instruction-guided Image Editing [63.347846732450364]
CompBench is a large-scale benchmark for complex instruction-guided image editing.<n>We propose an MLLM-human collaborative framework with tailored task pipelines.<n>We propose an instruction decoupling strategy that disentangles editing intents into four key dimensions.
arXiv Detail & Related papers (2025-05-18T02:30:52Z) - FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model [54.693572837423226]
FireEdit is an innovative Fine-grained Instruction-based image editing framework that exploits a REgion-aware VLM.<n>FireEdit is designed to accurately comprehend user instructions and ensure effective control over the editing process.<n>Our approach surpasses the state-of-the-art instruction-based image editing methods.
arXiv Detail & Related papers (2025-03-25T16:59:42Z) - BrushEdit: All-In-One Image Inpainting and Editing [76.93556996538398]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm.<n>We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model.<n>Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z) - InstructEdit: Instruction-based Knowledge Editing for Large Language Models [39.2147118489123]
We develop an instruction-based editing technique, termed InstructEdit, which facilitates the editor's adaptation to various task performances simultaneously using simple instructions.
Experiments involving holdout unseen task illustrate that InstructEdit consistently surpass previous strong baselines.
arXiv Detail & Related papers (2024-02-25T15:46:33Z) - SmartEdit: Exploring Complex Instruction-based Image Editing with
Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing.
It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities.
We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z) - Guiding Instruction-based Image Editing via Multimodal Large Language
Models [102.82211398699644]
Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation.
We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE)
MGIE learns to derive expressive instructions and provides explicit guidance.
arXiv Detail & Related papers (2023-09-29T10:01:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.