GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing
- URL: http://arxiv.org/abs/2407.05600v2
- Date: Mon, 28 Oct 2024 14:08:13 GMT
- Title: GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing
- Authors: Zhenyu Wang, Aoxue Li, Zhenguo Li, Xihui Liu,
- Abstract summary: GenArtist is a unified image generation and editing system coordinated by a multimodal large language model (MLLM) agent.
We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution.
Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance.
- Score: 60.09562648953926
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the success achieved by existing image generation and editing methods, current models still struggle with complex problems including intricate text prompts, and the absence of verification and self-correction mechanisms makes the generated images unreliable. Meanwhile, a single model tends to specialize in particular tasks and possess the corresponding capabilities, making it inadequate for fulfilling all user requirements. We propose GenArtist, a unified image generation and editing system, coordinated by a multimodal large language model (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. For a complex problem, the MLLM agent decomposes it into simpler sub-problems and constructs a tree structure to systematically plan the procedure of generation, editing, and self-correction with step-by-step verification. By automatically generating missing position-related inputs and incorporating position information, the appropriate tool can be effectively employed to address each sub-problem. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance and surpassing existing models such as SDXL and DALL-E 3, as can be seen in Fig. 1. Project page is https://zhenyuw16.github.io/GenArtist_page.
Related papers
- OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision [32.33777277141083]
We present omniedit, an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly.
omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage.
We provide images with different aspect ratios to ensure that our model can handle any image in the wild.
arXiv Detail & Related papers (2024-11-11T18:21:43Z) - VisionCoder: Empowering Multi-Agent Auto-Programming for Image Processing with Hybrid LLMs [8.380216582290025]
This paper presents a multi-agent framework that collaboratively completes auto-programming tasks.
Each agent plays a distinct role in the software development cycle, collectively forming a virtual organisation.
By establishing a tree-structured thought distribution and development mechanism across project, module, and function levels, this framework offers a cost-effective and efficient solution.
arXiv Detail & Related papers (2024-10-25T01:52:15Z) - Group Diffusion Transformers are Unsupervised Multitask Learners [49.288489286276146]
Group Diffusion Transformers (GDTs) are a novel framework that unifies diverse visual generation tasks.
GDTs build upon diffusion transformers with minimal architectural modifications by concatenating self-attention tokens across images.
We evaluate GDTs on a benchmark featuring over 200 instructions across 30 distinct visual generation tasks.
arXiv Detail & Related papers (2024-10-19T07:53:15Z) - Image Inpainting Models are Effective Tools for Instruction-guided Image Editing [42.63350374074953]
This technique report is for the winning solution of the CVPR2024 GenAI Media Generation Challenge Workshop's Instruction-guided Image Editing track.
We use a 4-step process IIIE (Inpainting-based Instruction-guided Image Editing): editing category classification, main editing object identification, editing mask acquisition, and image inpainting.
Results show that through proper combinations of language models and image inpainting models, our pipeline can reach a high success rate with satisfying visual quality.
arXiv Detail & Related papers (2024-07-18T03:55:33Z) - A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - PromptFix: You Prompt and We Fix the Photo [84.69812824355269]
Diffusion models equipped with language models demonstrate excellent controllability in image generation tasks.
The lack of diverse instruction-following data hampers the development of models.
We propose PromptFix, a framework that enables diffusion models to follow human instructions.
arXiv Detail & Related papers (2024-05-27T03:13:28Z) - Divide and Conquer: Language Models can Plan and Self-Correct for
Compositional Text-to-Image Generation [72.6168579583414]
CompAgent is a training-free approach for compositional text-to-image generation with a large language model (LLM) agent as its core.
Our approach achieves more than 10% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation.
arXiv Detail & Related papers (2024-01-28T16:18:39Z) - MELO: Enhancing Model Editing with Neuron-Indexed Dynamic LoRA [34.21194537887934]
We propose a plug-in Model Editing method based on neuron-indexed dynamic LoRA (MELO)
Our proposed MELO achieves state-of-the-art editing performance on three sequential editing tasks.
arXiv Detail & Related papers (2023-12-19T02:11:01Z) - Self-correcting LLM-controlled Diffusion Models [83.26605445217334]
We introduce Self-correcting LLM-controlled Diffusion (SLD)
SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image.
Our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships.
arXiv Detail & Related papers (2023-11-27T18:56:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.