Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts
- URL: http://arxiv.org/abs/2305.08850v2
- Date: Mon, 19 Feb 2024 02:42:27 GMT
- Title: Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts
- Authors: Yuyang Zhao, Enze Xie, Lanqing Hong, Zhenguo Li, Gim Hee Lee
- Abstract summary: We propose a generic video editing framework called Make-A-Protagonist.
Specifically, we leverage multiple experts to parse source video, target visual and textual clues, and propose a visual-textual-based video generation model.
Results demonstrate the versatile and remarkable editing capabilities of Make-A-Protagonist.
- Score: 116.05656635044357
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The text-driven image and video diffusion models have achieved unprecedented
success in generating realistic and diverse content. Recently, the editing and
variation of existing images and videos in diffusion-based generative models
have garnered significant attention. However, previous works are limited to
editing content with text or providing coarse personalization using a single
visual clue, rendering them unsuitable for indescribable content that requires
fine-grained and detailed control. In this regard, we propose a generic video
editing framework called Make-A-Protagonist, which utilizes textual and visual
clues to edit videos with the goal of empowering individuals to become the
protagonists. Specifically, we leverage multiple experts to parse source video,
target visual and textual clues, and propose a visual-textual-based video
generation model that employs mask-guided denoising sampling to generate the
desired output. Extensive results demonstrate the versatile and remarkable
editing capabilities of Make-A-Protagonist.
Related papers
- A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.
Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models.
T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z) - RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives [58.15403987979496]
This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework.
Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content.
The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation, video content editing, and can be incorporated into other SoTA video generative models for further enhancement.
arXiv Detail & Related papers (2024-05-28T17:46:36Z) - Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection [60.47731445033151]
We propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model.
Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.
arXiv Detail & Related papers (2024-05-27T04:44:36Z) - CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects.
We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z) - VideoDreamer: Customized Multi-Subject Text-to-Video Generation with
Disen-Mix Finetuning [47.61090084143284]
VideoDreamer can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects.
The video generator is further customized for the given multiple subjects by the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy.
arXiv Detail & Related papers (2023-11-02T04:38:50Z) - InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot
Text-based Video Editing [27.661609140918916]
InFusion is a framework for zero-shot text-based video editing.
It supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt.
Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training.
arXiv Detail & Related papers (2023-07-22T17:05:47Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - Gen-L-Video: Multi-Text to Long Video Generation via Temporal
Co-Denoising [43.35391175319815]
This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos.
We introduce a novel paradigm dubbed Gen-L-Video, capable of extending off-the-shelf short video diffusion models.
Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models.
arXiv Detail & Related papers (2023-05-29T17:38:18Z) - Structure and Content-Guided Video Synthesis with Diffusion Models [13.464501385061032]
We present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output.
Our model is trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method.
arXiv Detail & Related papers (2023-02-06T18:50:23Z) - Dreamix: Video Diffusion Models are General Video Editors [22.127604561922897]
Text-driven image and video diffusion models have recently achieved unprecedented generation realism.
We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos.
arXiv Detail & Related papers (2023-02-02T18:58:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.