Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
  Transformer
        - URL: http://arxiv.org/abs/2204.03638v1
- Date: Thu, 7 Apr 2022 17:59:02 GMT
- Title: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
  Transformer
- Authors: Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs,
  Jia-Bin Huang, Devi Parikh
- Abstract summary: We present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames.
Our evaluation shows that our model trained on 16-frame video clips can generate diverse, coherent, and high-quality long videos.
We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.
- Score: 66.56167074658697
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Videos are created to express emotion, exchange information, and share
experiences. Video synthesis has intrigued researchers for a long time. Despite
the rapid progress driven by advances in visual synthesis, most existing
studies focus on improving the frames' quality and the transitions between
them, while little progress has been made in generating longer videos. In this
paper, we present a method that builds on 3D-VQGAN and transformers to generate
videos with thousands of frames. Our evaluation shows that our model trained on
16-frame video clips from standard benchmarks such as UCF-101, Sky Time-lapse,
and Taichi-HD datasets can generate diverse, coherent, and high-quality long
videos. We also showcase conditional extensions of our approach for generating
meaningful long videos by incorporating temporal information with text and
audio. Videos and code can be found at
https://songweige.github.io/projects/tats/index.html.
 
      
        Related papers
        - Multimodal Long Video Modeling Based on Temporal Dynamic Context [13.979661295432964]
 We propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC)
We segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders.
To handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments.
 arXiv  Detail & Related papers  (2025-04-14T17:34:06Z)
- VideoAuteur: Towards Long Narrative Video Generation [22.915448471769384]
 We present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain.
We introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos.
Our method demonstrates substantial improvements in generating visually detailed and semantically aligneds.
 arXiv  Detail & Related papers  (2025-01-10T18:52:11Z)
- Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video
  Synthesis [69.83405335645305]
 We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability.
In this work, we build Snap Video, a video-first model that systematically addresses these challenges.
We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead.
This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
 arXiv  Detail & Related papers  (2024-02-22T18:55:08Z)
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled   Visual-Motional Tokenization [52.63845811751936]
 Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
 arXiv  Detail & Related papers  (2024-02-05T16:30:49Z)
- SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
  Prediction [93.26613503521664]
 This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
 arXiv  Detail & Related papers  (2023-10-31T17:58:17Z)
- Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
  Generators [70.17041424896507]
 Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
 arXiv  Detail & Related papers  (2023-03-23T17:01:59Z)
- Towards Smooth Video Composition [59.134911550142455]
 Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
 arXiv  Detail & Related papers  (2022-12-14T18:54:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.