DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot
Text-to-Video Generation
- URL: http://arxiv.org/abs/2305.14330v3
- Date: Tue, 6 Feb 2024 18:44:30 GMT
- Title: DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot
Text-to-Video Generation
- Authors: Susung Hong, Junyoung Seo, Heeseong Shin, Sunghwan Hong, Seungryong
Kim
- Abstract summary: This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos.
We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training.
The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
- Score: 37.25815760042241
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the paradigm of AI-generated content (AIGC), there has been increasing
attention to transferring knowledge from pre-trained text-to-image (T2I) models
to text-to-video (T2V) generation. Despite their effectiveness, these
frameworks face challenges in maintaining consistent narratives and handling
shifts in scene composition or object placement from a single abstract user
prompt. Exploring the ability of large language models (LLMs) to generate
time-dependent, frame-by-frame prompts, this paper introduces a new framework,
dubbed DirecT2V. DirecT2V leverages instruction-tuned LLMs as directors,
enabling the inclusion of time-varying content and facilitating consistent
video generation. To maintain temporal consistency and prevent mapping the
value to a different object, we equip a diffusion model with a novel value
mapping method and dual-softmax filtering, which do not require any additional
training. The experimental results validate the effectiveness of our framework
in producing visually coherent and storyful videos from abstract user prompts,
successfully addressing the challenges of zero-shot video generation.
Related papers
- VideoTetris: Towards Compositional Text-to-Video Generation [45.395598467837374]
VideoTetris is a framework that enables compositional T2V generation.
We show that VideoTetris achieves impressive qualitative and quantitative results in T2V generation.
arXiv Detail & Related papers (2024-06-06T17:25:33Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM
Animator [59.589919015669274]
This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient.
We propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence.
We also propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path.
arXiv Detail & Related papers (2023-09-25T19:42:16Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.