Video Creation by Demonstration
- URL: http://arxiv.org/abs/2412.09551v1
- Date: Thu, 12 Dec 2024 18:41:20 GMT
- Title: Video Creation by Demonstration
- Authors: Yihong Sun, Hao Zhou, Liangzhe Yuan, Jennifer J. Sun, Yandong Li, Xuhui Jia, Hartwig Adam, Bharath Hariharan, Long Zhao, Ting Liu,
- Abstract summary: We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction.<n>By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process.<n> Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
- Score: 59.389591010842636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We explore a novel video creation experience, namely Video Creation by Demonstration. Given a demonstration video and a context image from a different scene, we generate a physically plausible video that continues naturally from the context image and carries out the action concepts from the demonstration. To enable this capability, we present $\delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. Unlike most existing video generation controls that are based on explicit signals, we adopts the form of implicit latent control for maximal flexibility and expressiveness required by general videos. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process with minimal appearance leakage. Empirically, $\delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations, and demonstrates potentials towards interactive world simulation. Sampled video generation results are available at https://delta-diffusion.github.io/.
Related papers
- Mobius: Text to Seamless Looping Video Generation via Latent Shift [50.04534295458244]
We present Mobius, a novel method to generate seamlessly looping videos from text descriptions directly without any user annotations.
Our method repurposes the pre-trained video latent diffusion model for generating looping videos from text prompts without any training.
arXiv Detail & Related papers (2025-02-27T17:33:51Z) - PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning [19.67005754615478]
PlaySlot is an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences.
PlaySlot allows to generate multiple possible futures conditioned on latent actions, which can be inferred from video dynamics.
Our results show that PlaySlot outperforms both object-centric baselines for video prediction across different environments.
arXiv Detail & Related papers (2025-02-11T14:50:10Z) - Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation [54.21476271127356]
Divot is a Diffusion-Powered Video Tokenizer.<n>We present Divot-unaVic through video-to-text autoregression and text-to-video generation.
arXiv Detail & Related papers (2024-12-05T18:53:04Z) - Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators [46.40277880351059]
We explore utilizing visual signals as a new interface for models to interact with the environment.
We find that the model emerges a zero-shot capability to infer the semantics from a demonstration video, and imitate the semantics to an unseen scenario.
Results show that our models can generate high-quality video clips that accurately align with the semantic guidance provided by the demonstration videos.
arXiv Detail & Related papers (2024-07-10T04:27:06Z) - VideoPhy: Evaluating Physical Commonsense for Video Generation [93.28748850301949]
We present VideoPhy, a benchmark designed to assess whether the generated videos follow physical commonsense for real-world activities.
We then generate videos conditioned on captions from diverse state-of-the-art text-to-video generative models.
Our human evaluation reveals that the existing models severely lack the ability to generate videos adhering to the given text prompts.
arXiv Detail & Related papers (2024-06-05T17:53:55Z) - BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [40.73982918337828]
We propose a training-free general-purpose video synthesis framework, coined as bf BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models.
Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models.
arXiv Detail & Related papers (2023-12-05T14:56:55Z) - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately.
We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens.
Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.