A Unified Framework for Human-centric Point Cloud Video Understanding
- URL: http://arxiv.org/abs/2403.20031v1
- Date: Fri, 29 Mar 2024 07:53:06 GMT
- Title: A Unified Framework for Human-centric Point Cloud Video Understanding
- Authors: Yiteng Xu, Kecheng Ye, Xiao Han, Yiming Ren, Xinge Zhu, Yuexin Ma,
- Abstract summary: Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds.
We propose a unified framework to make full use of the prior knowledge and explore the inherent features in the data itself for generalized human-centric point cloud video understanding.
Our method achieves state-of-the-art performance on various human-related tasks, including action recognition and 3D pose estimation.
- Score: 23.91555808792291
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds, further advancing downstream human-centric tasks and applications. Previous works usually focus on tackling one specific task and rely on huge labeled data, which has poor generalization capability. Considering that human has specific characteristics, including the structural semantics of human body and the dynamics of human motions, we propose a unified framework to make full use of the prior knowledge and explore the inherent features in the data itself for generalized human-centric point cloud video understanding. Extensive experiments demonstrate that our method achieves state-of-the-art performance on various human-related tasks, including action recognition and 3D pose estimation. All datasets and code will be released soon.
Related papers
- HINT: Learning Complete Human Neural Representations from Limited Viewpoints [69.76947323932107]
We propose a NeRF-based algorithm able to learn a detailed and complete human model from limited viewing angles.
As a result, our method can reconstruct complete humans even from a few viewing angles, increasing performance by more than 15% PSNR.
arXiv Detail & Related papers (2024-05-30T05:43:09Z) - Expressive Forecasting of 3D Whole-body Human Motions [38.93700642077312]
We are the first to formulate a whole-body human pose forecasting framework.
Our model involves two key constituents: cross-context alignment (XCA) and cross-context interaction (XCI)
We conduct extensive experiments on a newly-introduced large-scale benchmark and achieve state-of-theart performance.
arXiv Detail & Related papers (2023-12-19T09:09:46Z) - Human-centric Scene Understanding for 3D Large-scale Scenarios [52.12727427303162]
We present a large-scale multi-modal dataset for human-centric scene understanding, dubbed HuCenLife.
Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc.
arXiv Detail & Related papers (2023-07-26T08:40:46Z) - Deep Learning Technique for Human Parsing: A Survey and Outlook [5.236995853909988]
In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing.
We put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research.
We point out a set of under-investigated open issues in this field and suggest new directions for future study.
arXiv Detail & Related papers (2023-01-01T12:39:57Z) - Video-based Pose-Estimation Data as Source for Transfer Learning in
Human Activity Recognition [71.91734471596433]
Human Activity Recognition (HAR) using on-body devices identifies specific human actions in unconstrained environments.
Previous works demonstrated that transfer learning is a good strategy for addressing scenarios with scarce data.
This paper proposes using datasets intended for human-pose estimation as a source for transfer learning.
arXiv Detail & Related papers (2022-12-02T18:19:36Z) - MECCANO: A Multimodal Egocentric Dataset for Humans Behavior
Understanding in the Industrial-like Domain [23.598727613908853]
We present MECCANO, a dataset of egocentric videos to study humans behavior understanding in industrial-like settings.
The multimodality is characterized by the presence of gaze signals, depth maps and RGB videos acquired simultaneously with a custom headset.
The dataset has been explicitly labeled for fundamental tasks in the context of human behavior understanding from a first person view.
arXiv Detail & Related papers (2022-09-19T00:52:42Z) - Learn to Predict How Humans Manipulate Large-sized Objects from
Interactive Motions [82.90906153293585]
We propose a graph neural network, HO-GCN, to fuse motion data and dynamic descriptors for the prediction task.
We show the proposed network that consumes dynamic descriptors can achieve state-of-the-art prediction results and help the network better generalize to unseen objects.
arXiv Detail & Related papers (2022-06-25T09:55:39Z) - GIMO: Gaze-Informed Human Motion Prediction in Context [75.52839760700833]
We propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, and ego-centric views with eye gaze.
Our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects.
To realize the full potential of gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches.
arXiv Detail & Related papers (2022-04-20T13:17:39Z) - A Multi-viewpoint Outdoor Dataset for Human Action Recognition [3.522154868524807]
We present a multi-viewpoint outdoor action recognition dataset collected from YouTube and our own drone.
The dataset consists of 20 dynamic human action classes, 2324 video clips and 503086 frames.
The overall baseline action recognition accuracy is 74.0%.
arXiv Detail & Related papers (2021-10-07T14:50:43Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.