Conditional Temporal Variational AutoEncoder for Action Video Prediction
- URL: http://arxiv.org/abs/2108.05658v1
- Date: Thu, 12 Aug 2021 10:59:23 GMT
- Title: Conditional Temporal Variational AutoEncoder for Action Video Prediction
- Authors: Xiaogang Xu, Yi Wang, Liwei Wang, Bei Yu, Jiaya Jia
- Abstract summary: ACT-VAE predicts pose sequences for an action clips from a single input image.
When connected with a plug-and-play Pose-to-Image (P2I) network, ACT-VAE can synthesize image sequences.
- Score: 66.63038712306606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To synthesize a realistic action sequence based on a single human image, it
is crucial to model both motion patterns and diversity in the action video.
This paper proposes an Action Conditional Temporal Variational AutoEncoder
(ACT-VAE) to improve motion prediction accuracy and capture movement diversity.
ACT-VAE predicts pose sequences for an action clips from a single input image.
It is implemented as a deep generative model that maintains temporal coherence
according to the action category with a novel temporal modeling on latent
space. Further, ACT-VAE is a general action sequence prediction framework. When
connected with a plug-and-play Pose-to-Image (P2I) network, ACT-VAE can
synthesize image sequences. Extensive experiments bear out our approach can
predict accurate pose and synthesize realistic image sequences, surpassing
state-of-the-art approaches. Compared to existing methods, ACT-VAE improves
model accuracy and preserves diversity.
Related papers
- Generalizable Implicit Motion Modeling for Video Frame Interpolation [51.966062283735596]
Motion is critical in flow-based Video Frame Interpolation (VFI)
We introduce General Implicit Motion Modeling (IMM), a novel and effective approach to motion modeling VFI.
Our GIMM can be easily integrated with existing flow-based VFI works by supplying accurately modeled motion.
arXiv Detail & Related papers (2024-07-11T17:13:15Z) - Coherent Temporal Synthesis for Incremental Action Segmentation [42.46228728930902]
This paper presents the first exploration of video data replay techniques for incremental action segmentation.
We propose a Temporally Coherent Action model, which represents actions using a generative model instead of storing individual frames.
In a 10-task incremental setup on the Breakfast dataset, our approach achieves significant increases in accuracy for up to 22% compared to the baselines.
arXiv Detail & Related papers (2024-03-10T06:07:06Z) - Interactive Character Control with Auto-Regressive Motion Diffusion Models [18.727066177880708]
We propose A-MDM (Auto-regressive Motion Diffusion Model) for real-time motion synthesis.
Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on previous frame.
We introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning.
arXiv Detail & Related papers (2023-06-01T07:48:34Z) - STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition [50.064502884594376]
We study the problem of human action recognition using motion capture (MoCap) sequences.
We propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences.
The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
arXiv Detail & Related papers (2023-03-31T16:19:27Z) - STDepthFormer: Predicting Spatio-temporal Depth from Video with a
Self-supervised Transformer Model [0.0]
Self-supervised model simultaneously predicts a sequence of future frames from video-input with a spatial-temporal attention network is proposed.
The proposed model leverages prior scene knowledge such as object shape and texture similar to single-image depth inference methods.
It is implicitly capable of forecasting the motion of objects in the scene, rather than requiring complex models involving multi-object detection, segmentation and tracking.
arXiv Detail & Related papers (2023-03-02T12:22:51Z) - Weakly-supervised Action Transition Learning for Stochastic Human Motion
Prediction [81.94175022575966]
We introduce the task of action-driven human motion prediction.
It aims to predict multiple plausible future motions given a sequence of action labels and a short motion history.
arXiv Detail & Related papers (2022-05-31T08:38:07Z) - Generating Smooth Pose Sequences for Diverse Human Motion Prediction [90.45823619796674]
We introduce a unified deep generative network for both diverse and controllable motion prediction.
Our experiments on two standard benchmark datasets, Human3.6M and HumanEva-I, demonstrate that our approach outperforms the state-of-the-art baselines in terms of both sample diversity and accuracy.
arXiv Detail & Related papers (2021-08-19T00:58:00Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [44.523477804533364]
We tackle the problem of action-conditioned generation of realistic and diverse human motion sequences.
In contrast to methods that complete, or extend, motion sequences, this task does not require an initial pose or sequence.
We learn an action-aware latent representation for human motions by training a generative variational autoencoder.
arXiv Detail & Related papers (2021-04-12T17:40:27Z) - Learning a Generative Motion Model from Image Sequences based on a
Latent Motion Matrix [8.774604259603302]
We learn a probabilistic motion model from simulating temporal-temporal registration in a sequence of images.
We show improved registration accuracy-temporally smoother consistencys compared to three state-of-the-art registration algorithms.
We also demonstrate the model's applicability for motion analysis, simulation and super-resolution by an improved motion reconstruction from sequences with missing frames.
arXiv Detail & Related papers (2020-11-03T14:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.