Depth-Aware Action Recognition: Pose-Motion Encoding through Temporal
Heatmaps
- URL: http://arxiv.org/abs/2011.13399v1
- Date: Thu, 26 Nov 2020 17:26:42 GMT
- Title: Depth-Aware Action Recognition: Pose-Motion Encoding through Temporal
Heatmaps
- Authors: Mattia Segu, Federico Pirovano, Gianmario Fumagalli, Amedeo Fabris
- Abstract summary: We propose a depth-aware descriptor that encodes pose and motion information in a unified representation for action classification in-the-wild.
The key component of our method is the Depth-Aware Pose Motion representation (DA-PoTion), a new video descriptor that encodes the 3D movement of semantic keypoints of the human body.
- Score: 2.2079886535603084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most state-of-the-art methods for action recognition rely only on 2D spatial
features encoding appearance, motion or pose. However, 2D data lacks the depth
information, which is crucial for recognizing fine-grained actions. In this
paper, we propose a depth-aware volumetric descriptor that encodes pose and
motion information in a unified representation for action classification
in-the-wild. Our framework is robust to many challenges inherent to action
recognition, e.g. variation in viewpoint, scene, clothing and body shape. The
key component of our method is the Depth-Aware Pose Motion representation
(DA-PoTion), a new video descriptor that encodes the 3D movement of semantic
keypoints of the human body. Given a video, we produce human joint heatmaps for
each frame using a state-of-the-art 3D human pose regressor and we give each of
them a unique color code according to the relative time in the clip. Then, we
aggregate such 3D time-encoded heatmaps for all human joints to obtain a
fixed-size descriptor (DA-PoTion), which is suitable for classifying actions
using a shallow 3D convolutional neural network (CNN). The DA-PoTion alone
defines a new state-of-the-art on the Penn Action Dataset. Moreover, we
leverage the intrinsic complementarity of our pose motion descriptor with
appearance based approaches by combining it with Inflated 3D ConvNet (I3D) to
define a new state-of-the-art on the JHMDB Dataset.
Related papers
- 2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation? [5.408549711581793]
We study the effect of using either 2D or 3D joint coordinates as training data on the performance of speech-to-gesture deep generative models.
We employ a lifting model for converting generated 2D pose sequences into 3D and assess how gestures created directly in 3D stack up against those initially generated in 2D and then converted to 3D.
arXiv Detail & Related papers (2024-09-16T15:06:12Z) - DGD: Dynamic 3D Gaussians Distillation [14.7298711927857]
We tackle the task of learning dynamic 3D semantic radiance fields given a single monocular video as input.
Our learned semantic radiance field captures per-point semantics as well as color and geometric properties for a dynamic 3D scene.
We present DGD, a unified 3D representation for both the appearance and semantics of a dynamic 3D scene.
arXiv Detail & Related papers (2024-05-29T17:52:22Z) - Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video [23.93644678238666]
We propose a Pose and Mesh Co-Evolution network (PMCE) to recover 3D human motion from a video.
The proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency.
arXiv Detail & Related papers (2023-08-20T16:03:21Z) - BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown
Objects [89.2314092102403]
We present a near real-time method for 6-DoF tracking of an unknown object from a monocular RGBD video sequence.
Our method works for arbitrary rigid objects, even when visual texture is largely absent.
arXiv Detail & Related papers (2023-03-24T17:13:49Z) - Scene-Aware 3D Multi-Human Motion Capture from a Single Camera [83.06768487435818]
We consider the problem of estimating the 3D position of multiple humans in a scene as well as their body shape and articulation from a single RGB video recorded with a static camera.
We leverage recent advances in computer vision using large-scale pre-trained models for a variety of modalities, including 2D body joints, joint angles, normalized disparity maps, and human segmentation masks.
In particular, we estimate the scene depth and unique person scale from normalized disparity predictions using the 2D body joints and joint angles.
arXiv Detail & Related papers (2023-01-12T18:01:28Z) - IVT: An End-to-End Instance-guided Video Transformer for 3D Pose
Estimation [6.270047084514142]
Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos.
IVT enables learningtemporal contextual depth information from visual features and 3D poses directly from video frames.
Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.
arXiv Detail & Related papers (2022-08-06T02:36:33Z) - Gait Recognition in the Wild with Dense 3D Representations and A
Benchmark [86.68648536257588]
Existing studies for gait recognition are dominated by 2D representations like the silhouette or skeleton of the human body in constrained scenes.
This paper aims to explore dense 3D representations for gait recognition in the wild.
We build the first large-scale 3D representation-based gait recognition dataset, named Gait3D.
arXiv Detail & Related papers (2022-04-06T03:54:06Z) - Tracking People with 3D Representations [78.97070307547283]
We present a novel approach for tracking multiple people in video.
Unlike past approaches which employ 2D representations, we employ 3D representations of people, located in three-dimensional space.
We find that 3D representations are more effective than 2D representations for tracking in these settings.
arXiv Detail & Related papers (2021-11-15T16:15:21Z) - Action2video: Generating Videos of Human 3D Actions [31.665831044217363]
We aim to tackle the interesting yet challenging problem of generating videos of diverse and natural human motions from prescribed action categories.
Key issue lies in the ability to synthesize multiple distinct motion sequences that are realistic in their visual appearances.
Action2motionally generates plausible 3D pose sequences of a prescribed action category, which are processed and rendered by motion2video to form 2D videos.
arXiv Detail & Related papers (2021-11-12T20:20:37Z) - D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z) - Learning Compositional Representation for 4D Captures with Neural ODE [72.56606274691033]
We introduce a compositional representation for 4D captures, that disentangles shape, initial state, and motion respectively.
To model the motion, a neural Ordinary Differential Equation (ODE) is trained to update the initial state conditioned on the learned motion code.
A decoder takes the shape code and the updated pose code to reconstruct 4D captures at each time stamp.
arXiv Detail & Related papers (2021-03-15T10:55:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.