DeciWatch: A Simple Baseline for 10x Efficient 2D and 3D Pose Estimation
- URL: http://arxiv.org/abs/2203.08713v1
- Date: Wed, 16 Mar 2022 16:03:37 GMT
- Title: DeciWatch: A Simple Baseline for 10x Efficient 2D and 3D Pose Estimation
- Authors: Ailing Zeng, Xuan Ju, Lei Yang, Ruiyuan Gao, Xizhou Zhu, Bo Dai, Qiang
Xu
- Abstract summary: DeciWatch is a framework for video-based 2D/3D human pose estimation.
It can achieve 10 times efficiency improvement over existing works without any performance degradation.
- Score: 58.06505882939195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper proposes a simple baseline framework for video-based 2D/3D human
pose estimation that can achieve 10 times efficiency improvement over existing
works without any performance degradation, named DeciWatch. Unlike current
solutions that estimate each frame in a video, DeciWatch introduces a simple
yet effective sample-denoise-recover framework that only watches sparsely
sampled frames, taking advantage of the continuity of human motions and the
lightweight pose representation. Specifically, DeciWatch uniformly samples less
than 10% video frames for detailed estimation, denoises the estimated 2D/3D
poses with an efficient Transformer architecture, and then accurately recovers
the rest of the frames using another Transformer-based network. Comprehensive
experimental results on three video-based human pose estimation and body mesh
recovery tasks with four datasets validate the efficiency and effectiveness of
DeciWatch.
Related papers
- MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.
By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.
We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction [153.52406455209538]
Gamba is an end-to-end 3D reconstruction model from a single-view image.
It completes reconstruction within 0.05 seconds on a single NVIDIA A100 GPU.
arXiv Detail & Related papers (2024-03-27T17:40:14Z) - Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos.
Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks.
Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z) - Tracking by 3D Model Estimation of Unknown Objects in Videos [122.56499878291916]
We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation.
Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames.
The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose.
arXiv Detail & Related papers (2023-04-13T11:32:36Z) - Kinematic-aware Hierarchical Attention Network for Human Pose Estimation
in Videos [17.831839654593452]
Previous-based human pose estimation methods have shown promising results by leveraging features of consecutive frames.
Most approaches compromise accuracy to jitter and do not comprehend the temporal aspects of human motion.
We design an architecture that exploits kinematic keypoint features.
arXiv Detail & Related papers (2022-11-29T01:46:11Z) - Uplift and Upsample: Efficient 3D Human Pose Estimation with Uplifting
Transformers [28.586258731448687]
We present a Transformer-based pose uplifting scheme that can operate on temporally sparse 2D pose sequences.
We show how masked token modeling can be utilized for temporal upsampling within Transformer blocks.
We evaluate our method on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP.
arXiv Detail & Related papers (2022-10-12T12:00:56Z) - VideoPose: Estimating 6D object pose from videos [14.210010379733017]
We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos.
Our proposed network takes a pre-trained 2D object detector as input, and aggregates visual features through a recurrent neural network to make predictions at each frame.
Experimental evaluation on the YCB-Video dataset show that our approach is on par with the state-of-the-art algorithms.
arXiv Detail & Related papers (2021-11-20T20:57:45Z) - Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image.
Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space.
We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step.
This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z) - Light3DPose: Real-time Multi-Person 3D PoseEstimation from Multiple
Views [5.510992382274774]
We present an approach to perform 3D pose estimation of multiple people from a few calibrated camera views.
Our architecture aggregates feature-maps from a 2D pose estimator backbone into a comprehensive representation of the 3D scene.
The proposed method is inherently efficient: as a pure bottom-up approach, it is computationally independent of the number of people in the scene.
arXiv Detail & Related papers (2020-04-06T14:12:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.