Related papers: Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

URL: http://arxiv.org/abs/2305.18264v1
Date: Mon, 29 May 2023 17:38:18 GMT
Title: Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising
Authors: Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, Hongsheng Li
Abstract summary: This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos. We introduce a novel paradigm dubbed Gen-L-Video, capable of extending off-the-shelf short video diffusion models. Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models.
Score: 43.35391175319815
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Leveraging large-scale image-text datasets and advancements in diffusion models, text-driven generative models have made remarkable strides in the field of image generation and editing. This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos. Current methodologies for video generation and editing, while innovative, are often confined to extremely short videos (typically less than 24 frames) and are limited to a single text condition. These constraints significantly limit their applications given that real-world videos usually consist of multiple segments, each bearing different semantic information. To address this challenge, we introduce a novel paradigm dubbed as Gen-L-Video, capable of extending off-the-shelf short video diffusion models for generating and editing videos comprising hundreds of frames with diverse semantic segments without introducing additional training, all while preserving content consistency. We have implemented three mainstream text-driven video generation and editing methodologies and extended them to accommodate longer videos imbued with a variety of semantic segments with our proposed paradigm. Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models, offering new possibilities for future research and applications. The code is available at https://github.com/G-U-N/Gen-L-Video.

Related papers

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding [17.769963004697047]
We propose a human-inspired automatic video editing framework (HIVE)<n>Our approach incorporates character extraction, dialogue analysis, and narrative summarization through multimodal large language models.<n>Our framework consistently outperforms existing baselines across both general and advertisement-oriented editing tasks.
arXiv Detail & Related papers (2025-07-03T16:54:32Z)
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance [34.345125922868]
We propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM) Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort. Our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation.
arXiv Detail & Related papers (2025-03-13T14:07:58Z)
VideoAuteur: Towards Long Narrative Video Generation [22.915448471769384]
We present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos. Our method demonstrates substantial improvements in generating visually detailed and semantically aligneds.
arXiv Detail & Related papers (2025-01-10T18:52:11Z)
StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation [117.13475564834458]
We propose a new way of self-attention calculation, termed Consistent Self-Attention. To extend our method to long-range video generation, we introduce a novel semantic space temporal motion prediction module. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos.
arXiv Detail & Related papers (2024-05-02T16:25:16Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects [61.323597069037056]
Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects.
arXiv Detail & Related papers (2024-01-18T13:23:51Z)
MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user. Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process. Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z)
VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning [47.61090084143284]
VideoDreamer can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. The video generator is further customized for the given multiple subjects by the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy.
arXiv Detail & Related papers (2023-11-02T04:38:50Z)
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z)
Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos. To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process. The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.