A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
- URL: http://arxiv.org/abs/2312.15770v1
- Date: Mon, 25 Dec 2023 16:37:39 GMT
- Title: A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
- Authors: Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya
Zhang, Yujun Shen, Changxin Gao, Nong Sang
- Abstract summary: Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation.
We come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos.
- Score: 72.59262815400928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion-based text-to-video generation has witnessed impressive progress in
the past year yet still falls behind text-to-image generation. One of the key
reasons is the limited scale of publicly available data (e.g., 10M video-text
pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost
of video captioning. Instead, it could be far easier to collect unlabeled clips
from video platforms like YouTube. Motivated by this, we come up with a novel
text-to-video generation framework, termed TF-T2V, which can directly learn
with text-free videos. The rationale behind is to separate the process of text
decoding from that of temporal modeling. To this end, we employ a content
branch and a motion branch, which are jointly optimized with weights shared.
Following such a pipeline, we study the effect of doubling the scale of
training set (i.e., video-only WebVid10M) with some randomly collected
text-free videos and are encouraged to observe the performance improvement (FID
from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of
our approach. We also find that our model could enjoy sustainable performance
gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some
text labels for training. Finally, we validate the effectiveness and
generalizability of our ideology on both native text-to-video generation and
compositional video synthesis paradigms. Code and models will be publicly
available at https://tf-t2v.github.io/.
Related papers
- Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data [19.210471935816273]
We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD) and a new Feint6K dataset.
To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning.
Our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models.
arXiv Detail & Related papers (2024-07-18T01:55:48Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Text-Driven Video Acceleration: A Weakly-Supervised Reinforcement
Learning Method [6.172652648945223]
This paper presents a novel weakly-supervised methodology to accelerate instructional videos using text.
A novel joint reward function guides our agent to select which frames to remove and reduce the input video to a target length.
We also propose the Extended Visually-guided Document Attention Network (VDAN+), which can generate a highly discriminative embedding space.
arXiv Detail & Related papers (2022-03-29T17:43:01Z) - X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [26.581384985173116]
In text-video retrieval, the objective is to learn a cross-modal similarity function between a text and a video.
We propose a cross-modal attention model called X-Pool that reasons between a text and the frames of a video.
arXiv Detail & Related papers (2022-03-28T20:47:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.