DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot
  Text-to-Video Generation
        - URL: http://arxiv.org/abs/2305.14330v3
- Date: Tue, 6 Feb 2024 18:44:30 GMT
- Title: DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot
  Text-to-Video Generation
- Authors: Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, Seungryong
  Kim
- Abstract summary: This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos.
We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training.
The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
- Score: 37.25815760042241
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   In the paradigm of AI-generated content (AIGC), there has been increasing
attention to transferring knowledge from pre-trained text-to-image (T2I) models
to text-to-video (T2V) generation. Despite their effectiveness, these
frameworks face challenges in maintaining consistent narratives and handling
shifts in scene composition or object placement from a single abstract user
prompt. Exploring the ability of large language models (LLMs) to generate
time-dependent, frame-by-frame prompts, this paper introduces a new framework,
dubbed DirecT2V. DirecT2V leverages instruction-tuned LLMs as directors,
enabling the inclusion of time-varying content and facilitating consistent
video generation. To maintain temporal consistency and prevent mapping the
value to a different object, we equip a diffusion model with a novel value
mapping method and dual-softmax filtering, which do not require any additional
training. The experimental results validate the effectiveness of our framework
in producing visually coherent and storyful videos from abstract user prompts,
successfully addressing the challenges of zero-shot video generation.
 
      
        Related papers
        - SkyReels-V2: Infinite-length Film Generative Model [35.00453687783287]
 We propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework.
We establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement.
 arXiv  Detail & Related papers  (2025-04-17T16:37:27Z)
- Bridging Vision and Language: Modeling Causality and Temporality in   Video Narratives [0.0]
 We propose an enhanced framework that integrates a Causal-Temporal Reasoning Module into state-of-the-art LVLMs.
CTRM comprises two key components: the Causal Dynamics (CDE) and the Temporal Learner (TRL)
We design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets.
 arXiv  Detail & Related papers  (2024-12-14T07:28:38Z)
- VideoTetris: Towards Compositional Text-to-Video Generation [45.395598467837374]
 VideoTetris is a framework that enables compositional T2V generation.
We show that VideoTetris achieves impressive qualitative and quantitative results in T2V generation.
 arXiv  Detail & Related papers  (2024-06-06T17:25:33Z)
- Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video   Object Segmentation [72.90144343056227]
 We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
 arXiv  Detail & Related papers  (2024-03-18T17:59:58Z)
- Video-Teller: Enhancing Cross-Modal Generation with Fusion and
  Decoupling [79.49128866877922]
 Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
 arXiv  Detail & Related papers  (2023-10-08T03:35:27Z)
- Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM
  Animator [59.589919015669274]
 This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient.
We propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence.
We also propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path.
 arXiv  Detail & Related papers  (2023-09-25T19:42:16Z)
- Structured Video-Language Modeling with Temporal Grouping and Spatial   Grounding [112.3913646778859]
 We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
 arXiv  Detail & Related papers  (2023-03-28T22:45:07Z)
- Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
 Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
 arXiv  Detail & Related papers  (2022-09-29T13:59:46Z)
- Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
 Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
 arXiv  Detail & Related papers  (2021-09-23T16:29:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.