Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
- URL: http://arxiv.org/abs/2408.09787v1
- Date: Mon, 19 Aug 2024 08:27:31 GMT
- Title: Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation
- Authors: Yunxin Li, Haoyuan Shi, Baotian Hu, Longyue Wang, Jiashun Zhu, Jinyi Xu, Zhen Zhao, Min Zhang,
- Abstract summary: Anim-Director is an autonomous animation-making agent.
It harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools.
The whole process is notably autonomous without manual intervention.
- Score: 36.46957675498949
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director's script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output.
Related papers
- StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG)
StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process.
Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency.
Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z) - Compositional 3D-aware Video Generation with LLM Director [27.61057927559143]
We propose a novel paradigm that generates each concept in 3D representation separately and then composes them with priors from Large Language Models and 2D diffusion models.
Our method can generate high-fidelity videos from text with diverse motion and flexible control over each concept.
arXiv Detail & Related papers (2024-08-31T23:07:22Z) - Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation [4.147294190096431]
We introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations.
Given a natural language description of a video, multiple VLM agents auto-direct various processes of the generation pipeline.
Our generated videos show better quality than commercial video generation models in 5 metrics on video quality and instruction-following performance.
arXiv Detail & Related papers (2024-08-19T23:31:02Z) - LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation [62.232361821779335]
We introduce a tuning-free attention control framework, encapsulated by the progressive process of prompt-Aware editing, StablE animation geneRation, abbreviated as LASER.
We manipulate the model's spatial features and self-attention mechanisms to maintain animation integrity.
Our meticulous control over spatial features and self-attention ensures structural consistency in the images.
arXiv Detail & Related papers (2024-04-21T07:13:56Z) - Video-Driven Animation of Neural Head Avatars [3.5229503563299915]
We present a new approach for video-driven animation of high-quality neural 3D head models.
We introduce an LSTM-based animation network capable of translating person-independent expression features into personalized animation parameters.
arXiv Detail & Related papers (2024-03-07T10:13:48Z) - Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM
Animator [59.589919015669274]
This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient.
We propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence.
We also propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path.
arXiv Detail & Related papers (2023-09-25T19:42:16Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - MovieFactory: Automatic Movie Creation from Text using Large Generative
Models for Language and Images [92.13079696503803]
We present MovieFactory, a framework to generate cinematic-picture (3072$times$1280), film-style (multi-scene), and multi-modality (sounding) movies.
Our approach empowers users to create captivating movies with smooth transitions using simple text inputs.
arXiv Detail & Related papers (2023-06-12T17:31:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.