Generative Adversarial Network for Future Hand Segmentation from
Egocentric Video
- URL: http://arxiv.org/abs/2203.11305v1
- Date: Mon, 21 Mar 2022 19:41:44 GMT
- Title: Generative Adversarial Network for Future Hand Segmentation from
Egocentric Video
- Authors: Wenqi Jia, Miao Liu and James M. Rehg
- Abstract summary: We introduce the novel problem of anticipating a time series of future hand masks from ego video.
A key challenge is to model thetemporality of future head motions, which globally impact the head-worn camera video analysis.
- Score: 25.308139917320673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the novel problem of anticipating a time series of future hand
masks from egocentric video. A key challenge is to model the stochasticity of
future head motions, which globally impact the head-worn camera video analysis.
To this end, we propose a novel deep generative model -- EgoGAN, which uses a
3D Fully Convolutional Network to learn a spatio-temporal video representation
for pixel-wise visual anticipation, generates future head motion using
Generative Adversarial Network (GAN), and then predicts the future hand masks
based on the video representation and the generated future head motion. We
evaluate our method on both the EPIC-Kitchens and the EGTEA Gaze+ datasets. We
conduct detailed ablation studies to validate the design choices of our
approach. Furthermore, we compare our method with previous state-of-the-art
methods on future image segmentation and show that our method can more
accurately predict future hand masks.
Related papers
- EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos [49.24266108952835]
Given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video.
EgoExo-Gen explicitly models the hand-object dynamics for cross-view video prediction.
arXiv Detail & Related papers (2025-04-16T03:12:39Z) - E-Motion: Future Motion Simulation via Event Sequence Diffusion [86.80533612211502]
Event-based sensors may potentially offer a unique opportunity to predict future motion with a level of detail and precision previously unachievable.
We propose to integrate the strong learning capacity of the video diffusion model with the rich motion information of an event camera as a motion simulation framework.
Our findings suggest a promising direction for future research in enhancing the interpretative power and predictive accuracy of computer vision systems.
arXiv Detail & Related papers (2024-10-11T09:19:23Z) - Forecasting Future Videos from Novel Views via Disentangled 3D Scene Representation [54.60804602905519]
We learn an entangled representation, aiming to model layered scene geometry, motion forecasting and novel view synthesis together.
Our approach chooses to disentangle scene geometry from scene motion, via lifting the 2D scene to 3D point clouds.
To model future 3D scene motion, we propose a disentangled two-stage approach that initially forecasts ego-motion and subsequently the residual motion of dynamic objects.
arXiv Detail & Related papers (2024-07-31T08:54:50Z) - Video Prediction Models as General Visual Encoders [0.0]
The researchers propose using video prediction models as general visual encoders, leveraging their ability to capture critical spatial and temporal information.
Inspired by human vision studies, the approach aims to develop a latent space representative of motion from images.
Experiments involve adapting pre-trained video generative models, analyzing their latent spaces, and training custom decoders for foreground-background segmentation.
arXiv Detail & Related papers (2024-05-25T23:55:47Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Future Frame Prediction for Robot-assisted Surgery [57.18185972461453]
We propose a ternary prior guided variational autoencoder (TPG-VAE) model for future frame prediction in robotic surgical video sequences.
Besides content distribution, our model learns motion distribution, which is novel to handle the small movements of surgical tools.
arXiv Detail & Related papers (2021-03-18T15:12:06Z) - Mutual Information Based Method for Unsupervised Disentanglement of
Video Representation [0.0]
Video prediction models have found prospective applications in Maneuver Planning, Health care, Autonomous Navigation and Simulation.
One of the major challenges in future frame generation is due to the high dimensional nature of visual data.
We propose Mutual Information Predictive Auto-Encoder framework, that reduces the task of predicting high dimensional video frames.
arXiv Detail & Related papers (2020-11-17T13:16:07Z) - Unsupervised Video Representation Learning by Bidirectional Feature
Prediction [16.074111448606512]
This paper introduces a novel method for self-supervised video representation learning via feature prediction.
We argue that a supervisory signal arising from unobserved past frames is complementary to one that originates from the future frames.
We empirically show that utilizing both signals enriches the learned representations for the downstream task of action recognition.
arXiv Detail & Related papers (2020-11-11T19:42:31Z) - Head2Head++: Deep Facial Attributes Re-Targeting [6.230979482947681]
We leverage the 3D geometry of faces and Generative Adversarial Networks (GANs) to design a novel deep learning architecture for the task of facial and head reenactment.
We manage to capture the complex non-rigid facial motion from the driving monocular performances and synthesise temporally consistent videos.
Our system performs end-to-end reenactment in nearly real-time speed (18 fps)
arXiv Detail & Related papers (2020-06-17T23:38:37Z) - Head2Head: Video-based Neural Head Synthesis [50.32988828989691]
We propose a novel machine learning architecture for facial reenactment.
We show that the proposed method can transfer facial expressions, pose and gaze of a source actor to a target video in a photo-realistic fashion more accurately than state-of-the-art methods.
arXiv Detail & Related papers (2020-05-22T00:44:43Z) - Future Video Synthesis with Object Motion Prediction [54.31508711871764]
Instead of synthesizing images directly, our approach is designed to understand the complex scene dynamics.
The appearance of the scene components in the future is predicted by non-rigid deformation of the background and affine transformation of moving objects.
Experimental results on the Cityscapes and KITTI datasets show that our model outperforms the state-of-the-art in terms of visual quality and accuracy.
arXiv Detail & Related papers (2020-04-01T16:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.