Seer: Language Instructed Video Prediction with Latent Diffusion Models
- URL: http://arxiv.org/abs/2303.14897v3
- Date: Mon, 29 Jan 2024 03:18:25 GMT
- Title: Seer: Language Instructed Video Prediction with Latent Diffusion Models
- Authors: Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, Yang Gao
- Abstract summary: Text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning.
We propose a sample and computation-efficient model, named textbfSeer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis.
With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames.
- Score: 43.708550061909754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Imagining the future trajectory is the key for robots to make sound planning
and successfully reach their goals. Therefore, text-conditioned video
prediction (TVP) is an essential task to facilitate general robot policy
learning. To tackle this task and empower robots with the ability to foresee
the future, we propose a sample and computation-efficient model, named
\textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion
models along the temporal axis. We enhance the U-Net and language conditioning
model by incorporating computation-efficient spatial-temporal attention.
Furthermore, we introduce a novel Frame Sequential Text Decomposer module that
dissects a sentence's global instruction into temporally aligned
sub-instructions, ensuring precise integration into each frame of generation.
Our framework allows us to effectively leverage the extensive prior knowledge
embedded in pretrained T2I models across the frames. With the
adaptable-designed architecture, Seer makes it possible to generate
high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a
few layers on a small amount of data. The experimental results on Something
Something V2 (SSv2), Bridgedata and EpicKitchens-100 datasets demonstrate our
superior video prediction performance with around 480-GPU hours versus CogVideo
with over 12,480-GPU hours: achieving the 31% FVD improvement compared to the
current SOTA model on SSv2 and 83.7% average preference in the human
evaluation.
Related papers
- FrameBridge: Improving Image-to-Video Generation with Bridge Models [23.19370431940568]
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis.
We present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them.
We propose two techniques, SNR- Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
arXiv Detail & Related papers (2024-10-20T12:10:24Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal
Modeling [48.283659682112926]
We propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks.
We also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text.
arXiv Detail & Related papers (2022-10-21T13:03:49Z) - Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder.
GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.