Related papers: Playable Video Generation

Playable Video Generation

URL: http://arxiv.org/abs/2101.12195v1
Date: Thu, 28 Jan 2021 18:55:58 GMT
Title: Playable Video Generation
Authors: Willi Menapace, St\'ephane Lathuili\`ere, Sergey Tulyakov, Aliaksandr Siarohin, Elisa Ricci
Abstract summary: We aim at allowing a user to control the generated video by selecting a discrete action at every time step as when playing a video game. The difficulty of the task lies both in learning semantically consistent actions and in generating realistic videos conditioned on the user input. We propose a novel framework for PVG that is trained in a self-supervised manner on a large dataset of unlabelled videos.
Score: 47.531594626822155
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces the unsupervised learning problem of playable video generation (PVG). In PVG, we aim at allowing a user to control the generated video by selecting a discrete action at every time step as when playing a video game. The difficulty of the task lies both in learning semantically consistent actions and in generating realistic videos conditioned on the user input. We propose a novel framework for PVG that is trained in a self-supervised manner on a large dataset of unlabelled videos. We employ an encoder-decoder architecture where the predicted action labels act as bottleneck. The network is constrained to learn a rich action space using, as main driving loss, a reconstruction loss on the generated video. We demonstrate the effectiveness of the proposed approach on several datasets with wide environment variety. Further details, code and examples are available on our project page willi-menapace.github.io/playable-video-generation-website.

Related papers

Direct Motion Models for Assessing Generated Videos [38.04485796547767]
A current limitation of video generative video models is that they generate plausible looking frames, but poor motion. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data.
arXiv Detail & Related papers (2025-04-30T22:34:52Z)
Learning from Streaming Video with Orthogonal Gradients [62.51504086522027]
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks.
arXiv Detail & Related papers (2025-04-02T17:59:57Z)
PlaySlot: Learning Inverse Latent Dynamics for Controllable Object-Centric Video Prediction and Planning [19.67005754615478]
PlaySlot is an object-centric video prediction model that infers object representations and latent actions from unlabeled video sequences. PlaySlot allows to generate multiple possible futures conditioned on latent actions, which can be inferred from video dynamics. Our results show that PlaySlot outperforms both object-centric baselines for video prediction across different environments.
arXiv Detail & Related papers (2025-02-11T14:50:10Z)
Generative Video Propagation [87.15843701018099]
Our framework, GenProp, encodes the original video with a selective content encoder and propagates the changes made to the first frame using an image-to-video generation model. Experiment results demonstrate the leading performance of our model in various video tasks.
arXiv Detail & Related papers (2024-12-27T17:42:29Z)
Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process. Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z)
OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. This enables us to address various types of video tasks, including classification, captioning, and localization. We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z)
VGMShield: Mitigating Misuse of Video Generative Models [7.963591895964269]
We introduce VGMShield: a set of three straightforward but pioneering mitigations through the lifecycle of fake video generation. We first try to understand whether there is uniqueness in generated videos and whether we can differentiate them from real videos. Then, we investigate the textittracing problem, which maps a fake video back to a model that generates it.
arXiv Detail & Related papers (2024-02-20T16:39:23Z)
Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame. ATM outperforms strong video pre-training baselines by 80% on average. We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z)
Weakly Supervised Two-Stage Training Scheme for Deep Video Fight Detection Model [0.0]
Fight detection in videos is an emerging deep learning application with today's prevalence of surveillance systems and streaming media. Previous work has largely relied on action recognition techniques to tackle this problem. We design the fight detection model as a composition of an action-aware feature extractor and an anomaly score generator.
arXiv Detail & Related papers (2022-09-23T08:29:16Z)
Enabling Weakly-Supervised Temporal Action Localization from On-Device Learning of the Video Stream [5.215681853828831]
We propose an efficient video learning approach to learn from a long, untrimmed streaming video. To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.
arXiv Detail & Related papers (2022-08-25T13:41:03Z)
Autoencoding Video Latents for Adversarial Video Generation [0.0]
AVLAE is a two stream latent autoencoder where the video distribution is learned by adversarial training. We demonstrate that our approach learns to disentangle motion and appearance codes even without the explicit structural composition in the generator.
arXiv Detail & Related papers (2022-01-18T11:42:14Z)
Unsupervised Domain Adaptation for Video Semantic Segmentation [91.30558794056054]
Unsupervised Domain Adaptation for semantic segmentation has gained immense popularity since it can transfer knowledge from simulation to real. In this work, we present a new video extension of this task, namely Unsupervised Domain Adaptation for Video Semantic approaches. We show that our proposals significantly outperform previous image-based UDA methods both on image-level (mIoU) and video-level (VPQ) evaluation metrics.
arXiv Detail & Related papers (2021-07-23T07:18:20Z)
Unsupervised Multimodal Video-to-Video Translation via Self-Supervised Learning [92.17835753226333]
We propose a novel unsupervised video-to-video translation model. Our model decomposes the style and the content using the specialized UV-decoder structure. Our model can produce photo-realistic videos in a multimodal way.
arXiv Detail & Related papers (2020-04-14T13:44:30Z)
Non-Adversarial Video Synthesis with Learned Priors [53.26777815740381]
We focus on the problem of generating videos from latent noise vectors, without any reference input frames. We develop a novel approach that jointly optimize the input latent space, the weights of a recurrent neural network and a generator through non-adversarial learning. Our approach generates superior quality videos compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2020-03-21T02:57:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.