Phenaki: Variable Length Video Generation From Open Domain Textual
Description
- URL: http://arxiv.org/abs/2210.02399v1
- Date: Wed, 5 Oct 2022 17:18:28 GMT
- Title: Phenaki: Variable Length Video Generation From Open Domain Textual
Description
- Authors: Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan
Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze,
Dumitru Erhan
- Abstract summary: Phenaki is a model capable of realistic video synthesis given a sequence of textual prompts.
New model for learning video representation compresses the video to a small representation of discrete tokens.
To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts.
- Score: 21.610541668826006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Phenaki, a model capable of realistic video synthesis, given a
sequence of textual prompts. Generating videos from text is particularly
challenging due to the computational cost, limited quantities of high quality
text-video data and variable length of videos. To address these issues, we
introduce a new model for learning video representation which compresses the
video to a small representation of discrete tokens. This tokenizer uses causal
attention in time, which allows it to work with variable-length videos. To
generate video tokens from text we are using a bidirectional masked transformer
conditioned on pre-computed text tokens. The generated video tokens are
subsequently de-tokenized to create the actual video. To address data issues,
we demonstrate how joint training on a large corpus of image-text pairs as well
as a smaller number of video-text examples can result in generalization beyond
what is available in the video datasets. Compared to the previous video
generation methods, Phenaki can generate arbitrary long videos conditioned on a
sequence of prompts (i.e. time variable text or a story) in open domain. To the
best of our knowledge, this is the first time a paper studies generating videos
from time variable prompts. In addition, compared to the per-frame baselines,
the proposed video encoder-decoder computes fewer tokens per video but results
in better spatio-temporal consistency.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - A Recipe for Scaling up Text-to-Video Generation with Text-free Videos [72.59262815400928]
Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation.
We come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos.
arXiv Detail & Related papers (2023-12-25T16:37:39Z) - Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens [70.80127538938093]
Vista-LLaMA is a novel framework that maintains the consistent distance between all visual tokens and any language tokens.
We present a sequential visual projector that projects the current video frame into tokens of language space with the assistance of the previous frame.
arXiv Detail & Related papers (2023-12-12T09:47:59Z) - Weakly Supervised Video Representation Learning with Unaligned Text for
Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided.
We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video.
Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately.
We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens.
Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.