Related papers: VideoTetris: Towards Compositional Text-to-Video Generation

VideoTetris: Towards Compositional Text-to-Video Generation

URL: http://arxiv.org/abs/2406.04277v2
Date: Mon, 14 Oct 2024 07:20:01 GMT
Title: VideoTetris: Towards Compositional Text-to-Video Generation
Authors: Ye Tian, Ling Yang, Haotian Yang, Yuan Gao, Yufan Deng, Jingmin Chen, Xintao Wang, Zhaochen Yu, Xin Tao, Pengfei Wan, Di Zhang, Bin Cui,
Abstract summary: VideoTetris is a framework that enables compositional T2V generation. We show that VideoTetris achieves impressive qualitative and quantitative results in T2V generation.
Score: 45.395598467837374
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation. Code is available at: https://github.com/YangLing0818/VideoTetris

Related papers

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization [63.37161241355025]
Video-MSG is a training-free method for T2V generation based on Multimodal planning and Structured noise initialization. It guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models.
arXiv Detail & Related papers (2025-04-11T15:41:43Z)
Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance [2.5941932242768457]
Mask-guided video generation can control video generation through mask motion sequences. Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control. This approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality.
arXiv Detail & Related papers (2025-03-24T06:53:08Z)
HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models [63.65066762436074]
HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks.
arXiv Detail & Related papers (2025-03-14T15:36:39Z)
FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance [3.6519202494141125]
We introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism.<n>CTGM incorporates the Temporal Information (TII) and Temporal Affinity Refiner (TAR) at the beginning and end of cross-attention.<n>Our approach achieves state-of-the-art T2V generation results on the EvalCrafter benchmark.
arXiv Detail & Related papers (2024-08-15T14:47:44Z)
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [58.49820807662246]
We introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions. Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V.
arXiv Detail & Related papers (2024-03-21T18:27:29Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V) We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z)
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation [37.25815760042241]
This paper introduces a new framework, dubbed DirecT2V, to generate text-to-video (T2V) videos. We equip a diffusion model with a novel value mapping method and dual-softmax filtering, which do not require any additional training. The experimental results validate the effectiveness of our framework in producing visually coherent and storyful videos.
arXiv Detail & Related papers (2023-05-23T17:57:09Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V) We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.