Related papers: GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

URL: http://arxiv.org/abs/2412.04440v1
Date: Thu, 05 Dec 2024 18:56:05 GMT
Title: GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
Authors: Kaiyi Huang, Yukun Huang, Xuefei Ning, Zinan Lin, Yu Wang, Xihui Liu,
Abstract summary: We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation.<n>The collaborative workflow includes three stages: Design, Generation, and Redesign.<n>To tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario.
Score: 20.988801611785522
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects, temporal dynamics associated with different objects, and interactions between objects. Our key motivation is that complex tasks can be decomposed into simpler ones, each handled by a role-specialized MLLM agent. Multiple agents can collaborate together to achieve collective intelligence for complex goals. We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign, with an iterative loop between the Generation and Redesign stages to progressively verify and refine the generated videos. The Redesign stage is the most challenging stage that aims to verify the generated videos, suggest corrections, and redesign the text prompts, frame-wise layouts, and guidance scales for the next iteration of generation. To avoid hallucination of a single MLLM agent, we decompose this stage to four sequentially-executed MLLM-based agents: verification agent, suggestion agent, correction agent, and output structuring agent. Furthermore, to tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario. Extensive experiments demonstrate the effectiveness of GenMAC, achieving state-of-the art performance in compositional text-to-video generation.

Related papers

MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation [15.644911934279309]
Diffusion models have shown excellent performance in text-to-image generation.<n>We propose a Multi-agent Collaboration-based Compositional Diffusion for text-to-image generation for complex scenes.
arXiv Detail & Related papers (2025-05-05T13:50:03Z)
CoMA: Compositional Human Motion Generation with Multi-modal Agents [22.151443524452876]
CoMA is an agent-based solution for complex human motion generation, editing, and comprehension. Our framework enables generation of both short and long motion sequences with detailed instructions, text-guided motion editing, and self-correction.
arXiv Detail & Related papers (2024-12-10T09:08:41Z)
VideoGen-of-Thought: A Collaborative Framework for Multi-Shot Video Generation [70.61101071902596]
Current generation models excel at generating short clips but still struggle with creating multi-shot, movie-like videos.<n>We propose VideoGen-of-Thought (VGoT), a collaborative and training-free architecture designed specifically for multi-shot video generation.<n>Our experiments demonstrate that VGoT surpasses existing video generation methods in producing high-quality, coherent, multi-shot videos.
arXiv Detail & Related papers (2024-12-03T08:33:50Z)
SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation [38.96874874208242]
We introduce a novel hierarchical framework named SIMS that seamlessly bridges highlevel script-driven intent with a low-level control policy. Specifically, we employ Large Language Models with Retrieval-Augmented Generation to generate coherent and diverse long-form scripts. A versatile multicondition physics-based control policy is also developed, which leverages text embeddings from the generated scripts to encode stylistic cues.
arXiv Detail & Related papers (2024-11-29T18:36:15Z)
StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG) StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z)
Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation [44.457347230146404]
We leverage the scene graph, a powerful structured representation, for complex image generation. We employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner. Our method outperforms recent competitors based on text, layout, or scene graph.
arXiv Detail & Related papers (2024-10-01T07:02:46Z)
GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing [60.09562648953926]
GenArtist is a unified image generation and editing system coordinated by a multimodal large language model (MLLM) agent. We integrate a comprehensive range of existing models into the tool library and utilize the agent for tool selection and execution. Experiments demonstrate that GenArtist can perform various generation and editing tasks, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-07-08T04:30:53Z)
Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting [49.87694319431288]
Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources. We propose a Comprehensive Generative (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs. Experiments on incremental tasks (cardiac, fundus and prostate segmentation) show its clear advantage for alleviating concurrent appearance and semantic forgetting.
arXiv Detail & Related papers (2024-06-28T10:05:58Z)
Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation [72.6168579583414]
CompAgent is a training-free approach for compositional text-to-image generation with a large language model (LLM) agent as its core. Our approach achieves more than 10% improvement on T2I-CompBench, a comprehensive benchmark for open-world compositional T2I generation.
arXiv Detail & Related papers (2024-01-28T16:18:39Z)
CCA: Collaborative Competitive Agents for Image Editing [55.500493143796405]
This paper presents a novel generative model, Collaborative Competitive Agents (CCA) It leverages the capabilities of multiple Large Language Models (LLMs) based agents to execute complex tasks. The paper's main contributions include the introduction of a multi-agent-based generative model with controllable intermediate steps and iterative optimization.
arXiv Detail & Related papers (2024-01-23T11:46:28Z)
Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs [77.86214400258473]
We propose a new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG) RPG harnesses the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our framework exhibits wide compatibility with various MLLM architectures.
arXiv Detail & Related papers (2024-01-22T06:16:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.