Related papers: GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

URL: http://arxiv.org/abs/2407.05600v1
Date: Mon, 8 Jul 2024 04:30:53 GMT
Title: GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing
Authors: Zhenyu Wang, Aoxue Li, Zhenguo Li, Xihui Liu,
Abstract summary: GenArtist is a unified image generation and editing system coordinated by a multimodal large language model (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance.
Score: 60.09562648953926
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the success achieved by existing image generation and editing methods, current models still struggle with complex problems including intricate text prompts, and the absence of verification and self-correction mechanisms makes the generated images unreliable. Meanwhile, a single model tends to specialize in particular tasks and possess the corresponding capabilities, making it inadequate for fulfilling all user requirements. We propose GenArtist, a unified image generation and editing system, coordinated by a multimodal large language model (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. For a complex problem, the MLLM agent decomposes it into simpler sub-problems and constructs a tree structure to systematically plan the procedure of generation, editing, and self-correction with step-by-step verification. By automatically generating missing position-related inputs and incorporating position information, the appropriate tool can be effectively employed to address each sub-problem. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance and surpassing existing models such as SDXL and DALL-E 3, as can be seen in Fig. 1. Project page is https://zhenyuw16.github.io/GenArtist_page.

Related papers

ImgEdit: A Unified Image Editing Dataset and Benchmark [14.185771939071149]
We introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs.<n>ImgEdit surpasses existing datasets in both task novelty and data quality.<n>For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance.
arXiv Detail & Related papers (2025-05-26T17:53:33Z)
Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment [55.74860093731475]
Marmot is a novel framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>We construct a multi-agent self-correcting system featuring a decision-execution-verification mechanism.<n>Experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships.
arXiv Detail & Related papers (2025-04-10T16:54:28Z)
EditAR: Unified Conditional Generation with Autoregressive Models [58.093860528672735]
We propose EditAR, a single unified autoregressive framework for a variety of conditional image generation tasks. The model takes both images and instructions as inputs, and predicts the edited images tokens in a vanilla next-token paradigm. We evaluate its effectiveness across diverse tasks on established benchmarks, showing competitive performance to various state-of-the-art task-specific methods.
arXiv Detail & Related papers (2025-01-08T18:59:35Z)
UNIC-Adapter: Unified Image-instruction Adapter with Multi-modal Transformer for Image Generation [64.8341372591993]
We propose a new approach to unify controllable generation within a single framework. Specifically, we propose the unified image-instruction adapter (UNIC-Adapter) built on the Multi-Modal-Diffusion Transformer architecture. Our UNIC-Adapter effectively extracts multi-modal instruction information by incorporating both conditional images and task instructions.
arXiv Detail & Related papers (2024-12-25T15:19:02Z)
BrushEdit: All-In-One Image Inpainting and Editing [79.55816192146762]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm. We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model. Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z)
GraPE: A Generate-Plan-Edit Framework for Compositional T2I Synthesis [10.47359822447001]
We present an alternate paradigm for T2I synthesis, decomposing the task of complex multi-step generation into three steps. Our approach derives its strength from the fact that it is modular in nature, is training free, and can be applied over any combination of image generation and editing models.
arXiv Detail & Related papers (2024-12-08T22:29:56Z)
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration [20.988801611785522]
We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign. To tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario.
arXiv Detail & Related papers (2024-12-05T18:56:05Z)
SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing [50.098005973600024]
We propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent) SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models. Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos.
arXiv Detail & Related papers (2024-11-28T08:07:32Z)
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision [32.33777277141083]
We present omniedit, an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. We provide images with different aspect ratios to ensure that our model can handle any image in the wild.
arXiv Detail & Related papers (2024-11-11T18:21:43Z)
VisionCoder: Empowering Multi-Agent Auto-Programming for Image Processing with Hybrid LLMs [8.380216582290025]
This paper presents a multi-agent framework that collaboratively completes auto-programming tasks. Each agent plays a distinct role in the software development cycle, collectively forming a virtual organisation. By establishing a tree-structured thought distribution and development mechanism across project, module, and function levels, this framework offers a cost-effective and efficient solution.
arXiv Detail & Related papers (2024-10-25T01:52:15Z)
Group Diffusion Transformers are Unsupervised Multitask Learners [49.288489286276146]
Group Diffusion Transformers (GDTs) are a novel framework that unifies diverse visual generation tasks. GDTs build upon diffusion transformers with minimal architectural modifications by concatenating self-attention tokens across images. We evaluate GDTs on a benchmark featuring over 200 instructions across 30 distinct visual generation tasks.
arXiv Detail & Related papers (2024-10-19T07:53:15Z)
Image Inpainting Models are Effective Tools for Instruction-guided Image Editing [42.63350374074953]
This technique report is for the winning solution of the CVPR2024 GenAI Media Generation Challenge Workshop's Instruction-guided Image Editing track. We use a 4-step process IIIE (Inpainting-based Instruction-guided Image Editing): editing category classification, main editing object identification, editing mask acquisition, and image inpainting. Results show that through proper combinations of language models and image inpainting models, our pipeline can reach a high success rate with satisfying visual quality.
arXiv Detail & Related papers (2024-07-18T03:55:33Z)
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z)
PromptFix: You Prompt and We Fix the Photo [84.69812824355269]
Diffusion models equipped with language models demonstrate excellent controllability in image generation tasks. The lack of diverse instruction-following data hampers the development of models. We propose PromptFix, a framework that enables diffusion models to follow human instructions.
arXiv Detail & Related papers (2024-05-27T03:13:28Z)
Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation [72.6168579583414]
CompAgent is a training-free approach for compositional text-to-image generation with a large language model (LLM) agent as its core. Our approach achieves more than 10% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation.
arXiv Detail & Related papers (2024-01-28T16:18:39Z)
MELO: Enhancing Model Editing with Neuron-Indexed Dynamic LoRA [34.21194537887934]
We propose a plug-in Model Editing method based on neuron-indexed dynamic LoRA (MELO) Our proposed MELO achieves state-of-the-art editing performance on three sequential editing tasks.
arXiv Detail & Related papers (2023-12-19T02:11:01Z)
Self-correcting LLM-controlled Diffusion Models [83.26605445217334]
We introduce Self-correcting LLM-controlled Diffusion (SLD) SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image. Our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships.
arXiv Detail & Related papers (2023-11-27T18:56:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.