Seer: Language Instructed Video Prediction with Latent Diffusion Models
- URL: http://arxiv.org/abs/2303.14897v3
- Date: Mon, 29 Jan 2024 03:18:25 GMT
- Title: Seer: Language Instructed Video Prediction with Latent Diffusion Models
- Authors: Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, Yang Gao
- Abstract summary: Text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning.
We propose a sample and computation-efficient model, named textbfSeer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis.
With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames.
- Score: 43.708550061909754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imagining the future trajectory is the key for robots to make sound planning
and successfully reach their goals. Therefore, text-conditioned video
prediction (TVP) is an essential task to facilitate general robot policy
learning. To tackle this task and empower robots with the ability to foresee
the future, we propose a sample and computation-efficient model, named
\textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion
models along the temporal axis. We enhance the U-Net and language conditioning
model by incorporating computation-efficient spatial-temporal attention.
Furthermore, we introduce a novel Frame Sequential Text Decomposer module that
dissects a sentence's global instruction into temporally aligned
sub-instructions, ensuring precise integration into each frame of generation.
Our framework allows us to effectively leverage the extensive prior knowledge
embedded in pretrained T2I models across the frames. With the
adaptable-designed architecture, Seer makes it possible to generate
high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a
few layers on a small amount of data. The experimental results on Something
Something V2 (SSv2), Bridgedata and EpicKitchens-100 datasets demonstrate our
superior video prediction performance with around 480-GPU hours versus CogVideo
with over 12,480-GPU hours: achieving the 31% FVD improvement compared to the
current SOTA model on SSv2 and 83.7% average preference in the human
evaluation.
Related papers
- AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal
Modeling [48.283659682112926]
We propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks.
We also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text.
arXiv Detail & Related papers (2022-10-21T13:03:49Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z) - Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder.
GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z) - Understanding Road Layout from Videos as a Whole [82.30800791500869]
We formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently.
We exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information.
arXiv Detail & Related papers (2020-07-02T00:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.