Related papers: Seer: Language Instructed Video Prediction with Latent Diffusion Models

Seer: Language Instructed Video Prediction with Latent Diffusion Models

URL: http://arxiv.org/abs/2303.14897v3
Date: Mon, 29 Jan 2024 03:18:25 GMT
Title: Seer: Language Instructed Video Prediction with Latent Diffusion Models
Authors: Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, Yang Gao
Abstract summary: Text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning. We propose a sample and computation-efficient model, named textbfSeer, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames.
Score: 43.708550061909754
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named \textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We enhance the U-Net and language conditioning model by incorporating computation-efficient spatial-temporal attention. Furthermore, we introduce a novel Frame Sequential Text Decomposer module that dissects a sentence's global instruction into temporally aligned sub-instructions, ensuring precise integration into each frame of generation. Our framework allows us to effectively leverage the extensive prior knowledge embedded in pretrained T2I models across the frames. With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2), Bridgedata and EpicKitchens-100 datasets demonstrate our superior video prediction performance with around 480-GPU hours versus CogVideo with over 12,480-GPU hours: achieving the 31% FVD improvement compared to the current SOTA model on SSv2 and 83.7% average preference in the human evaluation.

Related papers

Frame-wise Conditioning Adaptation for Fine-Tuning Diffusion Models in Text-to-Video Prediction [36.82594554832902]
Text-video prediction (TVP) is a downstream video generation task that requires a model to produce subsequent video frames. We propose an adaptation-based strategy we label Frame-wise Conditioning Adaptation (FCA) We use FCA to fine-tune the T2V model, which incorporates the initial frame(s) as an extra condition.
arXiv Detail & Related papers (2025-03-17T09:06:21Z)
STIV: Scalable Text and Image Conditioned Video Generation [84.2574247093223]
We present a simple and scalable text-image-conditioned video generation method, named STIV. Our framework integrates image condition into a Diffusion Transformer (DiT) through frame replacement, while incorporating text conditioning. STIV can be easily extended to various applications, such as video prediction, frame, multi-view generation, and long video generation.
arXiv Detail & Related papers (2024-12-10T18:27:06Z)
FrameBridge: Improving Image-to-Video Generation with Bridge Models [23.19370431940568]
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis. We present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them. We propose two techniques, SNR- Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
arXiv Detail & Related papers (2024-10-20T12:10:24Z)
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens. DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z)
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z)
Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past. We leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z)
Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model. Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z)
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling [48.283659682112926]
We propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks. We also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text.
arXiv Detail & Related papers (2022-10-21T13:03:49Z)
Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder. GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.