Coordinate Transformer: Achieving Single-stage Multi-person Mesh
Recovery from Videos
- URL: http://arxiv.org/abs/2308.10334v1
- Date: Sun, 20 Aug 2023 18:23:07 GMT
- Title: Coordinate Transformer: Achieving Single-stage Multi-person Mesh
Recovery from Videos
- Authors: Haoyuan Li, Haoye Dong, Hanchao Jia, Dong Huang, Michael C.
Kampffmeyer, Liang Lin, Xiaodan Liang
- Abstract summary: Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond.
We propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner.
Experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively.
- Score: 91.44553585470688
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-person 3D mesh recovery from videos is a critical first step towards
automatic perception of group behavior in virtual reality, physical therapy and
beyond. However, existing approaches rely on multi-stage paradigms, where the
person detection and tracking stages are performed in a multi-person setting,
while temporal dynamics are only modeled for one person at a time.
Consequently, their performance is severely limited by the lack of inter-person
interactions in the spatial-temporal mesh recovery, as well as by detection and
tracking defects. To address these challenges, we propose the Coordinate
transFormer (CoordFormer) that directly models multi-person spatial-temporal
relations and simultaneously performs multi-mesh recovery in an end-to-end
manner. Instead of partitioning the feature map into coarse-scale patch-wise
tokens, CoordFormer leverages a novel Coordinate-Aware Attention to preserve
pixel-level spatial-temporal coordinate information. Additionally, we propose a
simple, yet effective Body Center Attention mechanism to fuse position
information. Extensive experiments on the 3DPW dataset demonstrate that
CoordFormer significantly improves the state-of-the-art, outperforming the
previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE,
and PVE metrics, respectively, while being 40% faster than recent video-based
approaches. The released code can be found at
https://github.com/Li-Hao-yuan/CoordFormer.
Related papers
- Enhancing 3D Human Pose Estimation Amidst Severe Occlusion with Dual Transformer Fusion [13.938406073551844]
This paper introduces a Dual Transformer Fusion (DTF) algorithm, a novel approach to obtain a holistic 3D pose estimation.
To enable precise 3D Human Pose Estimation, our approach leverages the innovative DTF architecture, which first generates a pair of intermediate views.
Our approach outperforms existing state-of-the-art methods on both datasets, yielding substantial improvements.
arXiv Detail & Related papers (2024-10-06T18:15:27Z) - Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe.
Recent deep learning-based VAD models have shown promising results by generating high-resolution frames.
We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z) - Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction [106.06256351200068]
This paper introduces a model learning framework with auxiliary tasks.
In our auxiliary tasks, partial body joints' coordinates are corrupted by either masking or adding noise.
We propose a novel auxiliary-adapted transformer, which can handle incomplete, corrupted motion data.
arXiv Detail & Related papers (2023-08-17T12:26:11Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - A Dual-Masked Auto-Encoder for Robust Motion Capture with
Spatial-Temporal Skeletal Token Completion [13.88656793940129]
We propose an adaptive, identity-aware triangulation module to reconstruct 3D joints and identify the missing joints for each identity.
We then propose a Dual-Masked Auto-Encoder (D-MAE) which encodes the joint status with both skeletal-structural and temporal position encoding for trajectory completion.
In order to demonstrate the proposed model's capability in dealing with severe data loss scenarios, we contribute a high-accuracy and challenging motion capture dataset.
arXiv Detail & Related papers (2022-07-15T10:00:43Z) - P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose
Estimation [78.83305967085413]
This paper introduces a novel Pre-trained Spatial Temporal Many-to-One (P-STMO) model for 2D-to-3D human pose estimation task.
Our method outperforms state-of-the-art methods with fewer parameters and less computational overhead.
arXiv Detail & Related papers (2022-03-15T04:00:59Z) - Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images.
MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks.
We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z) - Monocular, One-stage, Regression of Multiple 3D People [105.3143785498094]
We propose to Regress all meshes in a One-stage fashion for Multiple 3D People (termed ROMP)
Our method simultaneously predicts a Body Center heatmap and a Mesh map, which can jointly describe the 3D body mesh on the pixel level.
Compared with state-of-the-art methods, ROMP superior performance on the challenging multi-person benchmarks.
arXiv Detail & Related papers (2020-08-27T17:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.