Playable Video Generation
- URL: http://arxiv.org/abs/2101.12195v1
- Date: Thu, 28 Jan 2021 18:55:58 GMT
- Title: Playable Video Generation
- Authors: Willi Menapace, St\'ephane Lathuili\`ere, Sergey Tulyakov, Aliaksandr
Siarohin, Elisa Ricci
- Abstract summary: We aim at allowing a user to control the generated video by selecting a discrete action at every time step as when playing a video game.
The difficulty of the task lies both in learning semantically consistent actions and in generating realistic videos conditioned on the user input.
We propose a novel framework for PVG that is trained in a self-supervised manner on a large dataset of unlabelled videos.
- Score: 47.531594626822155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces the unsupervised learning problem of playable video
generation (PVG). In PVG, we aim at allowing a user to control the generated
video by selecting a discrete action at every time step as when playing a video
game. The difficulty of the task lies both in learning semantically consistent
actions and in generating realistic videos conditioned on the user input. We
propose a novel framework for PVG that is trained in a self-supervised manner
on a large dataset of unlabelled videos. We employ an encoder-decoder
architecture where the predicted action labels act as bottleneck. The network
is constrained to learn a rich action space using, as main driving loss, a
reconstruction loss on the generated video. We demonstrate the effectiveness of
the proposed approach on several datasets with wide environment variety.
Further details, code and examples are available on our project page
willi-menapace.github.io/playable-video-generation-website.
Related papers
- PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning [19.67005754615478]
PlaySlot is an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences.
PlaySlot allows to generate multiple possible futures conditioned on latent actions, which can be inferred from video dynamics.
Our results show that PlaySlot outperforms both object-centric baselines for video prediction across different environments.
arXiv Detail & Related papers (2025-02-11T14:50:10Z) - Generative Video Propagation [87.15843701018099]
Our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image-to-video generation model.
Experiment results demonstrate the leading performance of our model in various video tasks.
arXiv Detail & Related papers (2024-12-27T17:42:29Z) - Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction.
By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process.
Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - VGMShield: Mitigating Misuse of Video Generative Models [7.963591895964269]
We introduce VGMShield: a set of three straightforward but pioneering mitigations through the lifecycle of fake video generation.
We first try to understand whether there is uniqueness in generated videos and whether we can differentiate them from real videos.
Then, we investigate the textittracing problem, which maps a fake video back to a model that generates it.
arXiv Detail & Related papers (2024-02-20T16:39:23Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - Weakly Supervised Two-Stage Training Scheme for Deep Video Fight
Detection Model [0.0]
Fight detection in videos is an emerging deep learning application with today's prevalence of surveillance systems and streaming media.
Previous work has largely relied on action recognition techniques to tackle this problem.
We design the fight detection model as a composition of an action-aware feature extractor and an anomaly score generator.
arXiv Detail & Related papers (2022-09-23T08:29:16Z) - Enabling Weakly-Supervised Temporal Action Localization from On-Device
Learning of the Video Stream [5.215681853828831]
We propose an efficient video learning approach to learn from a long, untrimmed streaming video.
To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.
arXiv Detail & Related papers (2022-08-25T13:41:03Z) - Autoencoding Video Latents for Adversarial Video Generation [0.0]
AVLAE is a two stream latent autoencoder where the video distribution is learned by adversarial training.
We demonstrate that our approach learns to disentangle motion and appearance codes even without the explicit structural composition in the generator.
arXiv Detail & Related papers (2022-01-18T11:42:14Z) - Unsupervised Domain Adaptation for Video Semantic Segmentation [91.30558794056054]
Unsupervised Domain Adaptation for semantic segmentation has gained immense popularity since it can transfer knowledge from simulation to real.
In this work, we present a new video extension of this task, namely Unsupervised Domain Adaptation for Video Semantic approaches.
We show that our proposals significantly outperform previous image-based UDA methods both on image-level (mIoU) and video-level (VPQ) evaluation metrics.
arXiv Detail & Related papers (2021-07-23T07:18:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.