Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation
- URL: http://arxiv.org/abs/2206.04785v1
- Date: Thu, 9 Jun 2022 22:33:27 GMT
- Title: Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation
- Authors: Jinman Park, Kimathi Kaai, Saad Hossain, Norikatsu Sumi, Sirisha
Rambhatla, Paul Fieguth
- Abstract summary: We leverage information from past frames to guide our self-attention-based 3D estimation procedure -- Ego-STAN.
Specifically, we build atemporal Transformer model that attends to semantically rich convolutional neural network-based feature maps.
We demonstrate Ego-STAN's superior performance on the xR-EgoPose dataset.
- Score: 9.569752078386006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Egocentric 3D human pose estimation (HPE) from images is challenging due to
severe self-occlusions and strong distortion introduced by the fish-eye view
from the head mounted camera. Although existing works use intermediate
heatmap-based representations to counter distortion with some success,
addressing self-occlusion remains an open problem. In this work, we leverage
information from past frames to guide our self-attention-based 3D HPE
estimation procedure -- Ego-STAN. Specifically, we build a spatio-temporal
Transformer model that attends to semantically rich convolutional neural
network-based feature maps. We also propose feature map tokens: a new set of
learnable parameters to attend to these feature maps. Finally, we demonstrate
Ego-STAN's superior performance on the xR-EgoPose dataset where it achieves a
30.6% improvement on the overall mean per-joint position error, while leading
to a 22% drop in parameters compared to the state-of-the-art.
Related papers
- Estimating Body and Hand Motion in an Ego-sensed World [64.08911275906544]
We present EgoAllo, a system for human motion estimation from a head-mounted device.
Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters.
arXiv Detail & Related papers (2024-10-04T17:59:57Z) - Frequency-based View Selection in Gaussian Splatting Reconstruction [9.603843571051744]
We investigate the problem of active view selection to perform 3D Gaussian Splatting reconstructions with as few input images as possible.
By ranking the potential views in the frequency domain, we are able to effectively estimate the potential information gain of new viewpoints.
Our method achieves state-of-the-art results in view selection, demonstrating its potential for efficient image-based 3D reconstruction.
arXiv Detail & Related papers (2024-09-24T21:44:26Z) - GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.
Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z) - Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting [8.134443548271301]
We present EgoTAP, a heatmap-to-3D pose lifting method for highly accurate stereo egocentric 3D pose estimation.
Our method significantly outperforms the previous state-of-the-art qualitatively and quantitatively.
arXiv Detail & Related papers (2024-02-28T13:50:39Z) - SimpleEgo: Predicting Probabilistic Body Pose from Egocentric Cameras [6.476948781728137]
Egocentric human pose estimation is difficult from downwards-facing cameras on head-mounted devices (HMDs)
Previous solutions minimize this problem by using fish-eye camera lenses to capture a wider view, but these can present hardware design issues.
We predict pose from images captured with conventional rectilinear camera lenses. This resolves hardware design issues, but means body parts are often out of frame.
Our approach achieves state-of-the-art results for this challenging configuration, reducing mean per-joint position error by 23% overall and 58% for the lower body.
arXiv Detail & Related papers (2024-01-26T11:19:13Z) - 1st Place Solution of Egocentric 3D Hand Pose Estimation Challenge 2023
Technical Report:A Concise Pipeline for Egocentric Hand Pose Reconstruction [11.551318550321938]
Using AssemblyHands, this challenge focuses on egocentric 3D hand pose estimation from a single-view image.
We adopt ViT based backbones and a simple regressor for 3D keypoints prediction, which provides strong model baselines.
Our method achieved 12.21mm MPJPE on test dataset, achieve the first place in Egocentric 3D Hand Pose Estimation challenge.
arXiv Detail & Related papers (2023-10-07T10:25:50Z) - Ego3DPose: Capturing 3D Cues from Binocular Egocentric Views [9.476008200056082]
Ego3DPose is a highly accurate binocular egocentric 3D pose reconstruction system.
We propose a two-path network architecture with a path that estimates pose per limb independently with its binocular heatmaps.
We propose a new perspective-aware representation using trigonometry, enabling the network to estimate the 3D orientation of limbs.
arXiv Detail & Related papers (2023-09-21T10:34:35Z) - Uncertainty-aware State Space Transformer for Egocentric 3D Hand
Trajectory Forecasting [79.34357055254239]
Hand trajectory forecasting is crucial for enabling a prompt understanding of human intentions when interacting with AR/VR systems.
Existing methods handle this problem in a 2D image space which is inadequate for 3D real-world applications.
We set up an egocentric 3D hand trajectory forecasting task that aims to predict hand trajectories in a 3D space from early observed RGB videos in a first-person view.
arXiv Detail & Related papers (2023-07-17T04:55:02Z) - RiCS: A 2D Self-Occlusion Map for Harmonizing Volumetric Objects [68.85305626324694]
Ray-marching in Camera Space (RiCS) is a new method to represent the self-occlusions of foreground objects in 3D into a 2D self-occlusion map.
We show that our representation map not only allows us to enhance the image quality but also to model temporally coherent complex shadow effects.
arXiv Detail & Related papers (2022-05-14T05:35:35Z) - Simple and Effective Synthesis of Indoor 3D Scenes [78.95697556834536]
We study the problem of immersive 3D indoor scenes from one or more images.
Our aim is to generate high-resolution images and videos from novel viewpoints.
We propose an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images.
arXiv Detail & Related papers (2022-04-06T17:54:46Z) - SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera [97.0162841635425]
We present a solution to egocentric 3D body pose estimation from monocular images captured from downward looking fish-eye cameras installed on the rim of a head mounted VR device.
This unusual viewpoint leads to images with unique visual appearance, with severe self-occlusions and perspective distortions.
We propose an encoder-decoder architecture with a novel multi-branch decoder designed to account for the varying uncertainty in 2D predictions.
arXiv Detail & Related papers (2020-11-02T16:18:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.