TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation
- URL: http://arxiv.org/abs/2405.04682v3
- Date: Sat, 25 May 2024 01:13:26 GMT
- Title: TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation
- Authors: Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, Kai-Wei Chang,
- Abstract summary: We introduce Time-Aligned Captions (TALC) framework to generate multi-scene videos.
Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions.
Our TALC-finetuned model outperforms the baseline methods on multi-scene video-text data by 15.5 points on aggregated score.
- Score: 72.25642183446102
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in diffusion-based generative modeling have led to the development of text-to-video (T2V) models that can generate high-quality videos conditioned on a text prompt. Most of these T2V models often produce single-scene video clips that depict an entity performing a particular action (e.g., 'a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., 'a red panda climbing a tree' followed by 'the red panda sleeps on the top of the tree'). To generate multi-scene videos from a pretrained T2V model, we introduce Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. As a result, we show that the pretrained T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., w.r.t entity and background). Our TALC-finetuned model outperforms the baseline methods on multi-scene video-text data by 15.5 points on aggregated score, averaging visual consistency and text adherence using human evaluation. The project website is https://talc-mst2v.github.io/.
Related papers
- Frame-Level Captions for Long Video Generation with Complex Multi Scenes [52.12699618126831]
We propose a novel way to annotate datasets at the frame-level.<n>This detailed guidance works with a Frame-Level Attention Mechanism to make sure text and video match precisely.<n>Our training uses Diffusion Forcing to provide the model with the ability to handle time flexibly.
arXiv Detail & Related papers (2025-05-27T07:39:43Z) - VAST 1.0: A Unified Framework for Controllable and Consistent Video Generation [48.318567065609216]
VAST (Video As Storyboard from Text) is a framework to generate high-quality videos from textual descriptions.
By decoupling text understanding from video generation, VAST enables precise control over subject dynamics and scene composition.
Experiments on the VBench benchmark demonstrate that VAST outperforms existing methods in both visual quality and semantic expression.
arXiv Detail & Related papers (2024-12-21T15:59:07Z) - ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models [13.04745908368858]
We introduce ViBe: a large-scale Text-to-Video Benchmark of hallucinated videos from T2V models.
Using 10 open-source T2V models, we developed the first large-scale dataset of hallucinated videos.
This benchmark aims to drive the development of robust T2V models that produce videos more accurately aligned with input prompts.
arXiv Detail & Related papers (2024-11-16T19:23:12Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning [62.51232333352754]
VideoDirectorGPT is a novel framework for consistent multi-scene video generation.
Our proposed framework substantially improves layout and movement control in both single- and multi-scene video generation.
arXiv Detail & Related papers (2023-09-26T17:36:26Z) - TaleCrafter: Interactive Story Visualization with Multiple Characters [49.14122401339003]
This paper proposes a system for generic interactive story visualization.
It is capable of handling multiple novel characters and supporting the editing of layout and local structure.
The system comprises four interconnected components: story-to-prompt generation (S2P), text-to-generation (T2L), controllable text-to-image generation (C-T2I) and image-to-video animation (I2V)
arXiv Detail & Related papers (2023-05-29T17:11:39Z) - DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot
Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos.
We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training.
The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z) - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.