Uncertainty-aware State Space Transformer for Egocentric 3D Hand
Trajectory Forecasting
- URL: http://arxiv.org/abs/2307.08243v2
- Date: Sun, 17 Sep 2023 02:40:01 GMT
- Title: Uncertainty-aware State Space Transformer for Egocentric 3D Hand
Trajectory Forecasting
- Authors: Wentao Bao, Lele Chen, Libing Zeng, Zhong Li, Yi Xu, Junsong Yuan, Yu
Kong
- Abstract summary: Hand trajectory forecasting is crucial for enabling a prompt understanding of human intentions when interacting with AR/VR systems.
Existing methods handle this problem in a 2D image space which is inadequate for 3D real-world applications.
We set up an egocentric 3D hand trajectory forecasting task that aims to predict hand trajectories in a 3D space from early observed RGB videos in a first-person view.
- Score: 79.34357055254239
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hand trajectory forecasting from egocentric views is crucial for enabling a
prompt understanding of human intentions when interacting with AR/VR systems.
However, existing methods handle this problem in a 2D image space which is
inadequate for 3D real-world applications. In this paper, we set up an
egocentric 3D hand trajectory forecasting task that aims to predict hand
trajectories in a 3D space from early observed RGB videos in a first-person
view. To fulfill this goal, we propose an uncertainty-aware state space
Transformer (USST) that takes the merits of the attention mechanism and
aleatoric uncertainty within the framework of the classical state-space model.
The model can be further enhanced by the velocity constraint and visual prompt
tuning (VPT) on large vision transformers. Moreover, we develop an annotation
workflow to collect 3D hand trajectories with high quality. Experimental
results on H2O and EgoPAT3D datasets demonstrate the superiority of USST for
both 2D and 3D trajectory forecasting. The code and datasets are publicly
released: https://actionlab-cv.github.io/EgoHandTrajPred.
Related papers
- Robust 3D Semantic Occupancy Prediction with Calibration-free Spatial Transformation [32.50849425431012]
For autonomous cars equipped with multi-camera and LiDAR, it is critical to aggregate multi-sensor information into a unified 3D space for accurate and robust predictions.
Recent methods are mainly built on the 2D-to-3D transformation that relies on sensor calibration to project the 2D image information into the 3D space.
In this work, we propose a calibration-free spatial transformation based on vanilla attention to implicitly model the spatial correspondence.
arXiv Detail & Related papers (2024-11-19T02:40:42Z) - WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild [53.288327629960364]
We present a data-driven pipeline for efficient multi-hand reconstruction in the wild.
The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model.
Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks.
arXiv Detail & Related papers (2024-09-18T18:46:51Z) - Real-time 3D semantic occupancy prediction for autonomous vehicles using memory-efficient sparse convolution [4.204990010424084]
In autonomous vehicles, understanding the surrounding 3D environment of the ego vehicle in real-time is essential.
State of the art 3D mapping methods leverage transformers with cross-attention mechanisms to elevate 2D vision-centric camera features into the 3D domain.
This paper introduces an approach that extracts features from front-view 2D camera images and LiDAR scans, then employs a sparse convolution network (Minkowski Engine) for 3D semantic occupancy prediction.
arXiv Detail & Related papers (2024-03-13T17:50:59Z) - Unified Spatio-Temporal Tri-Perspective View Representation for 3D Semantic Occupancy Prediction [6.527178779672975]
This study introduces architecture2TPVFormer for temporally coherent 3D semantic occupancy prediction.
We enrich the prior process by including temporal cues using a novel temporal cross-view hybrid attention mechanism.
Experimental evaluations demonstrate a substantial 4.1% improvement in mean Intersection over Union for 3D Semantic Occupancy.
arXiv Detail & Related papers (2024-01-24T20:06:59Z) - Social-Transmotion: Promptable Human Trajectory Prediction [65.80068316170613]
Social-Transmotion is a generic Transformer-based model that exploits diverse and numerous visual cues to predict human behavior.
Our approach is validated on multiple datasets, including JTA, JRDB, Pedestrians and Cyclists in Road Traffic, and ETH-UCY.
arXiv Detail & Related papers (2023-12-26T18:56:49Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z) - Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation [18.964403296437027]
Act3D represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand.
It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling.
arXiv Detail & Related papers (2023-06-30T17:34:06Z) - T3VIP: Transformation-based 3D Video Prediction [49.178585201673364]
We propose a 3D video prediction (T3VIP) approach that explicitly models the 3D motion by decomposing a scene into its object parts.
Our model is fully unsupervised, captures the nature of the real world, and the observational cues in image and point cloud domains constitute its learning signals.
To the best of our knowledge, our model is the first generative model that provides an RGB-D video prediction of the future for a static camera.
arXiv Detail & Related papers (2022-09-19T15:01:09Z) - Occlusion Guided Self-supervised Scene Flow Estimation on 3D Point
Clouds [4.518012967046983]
Understanding the flow in 3D space of sparsely sampled points between two consecutive time frames is the core stone of modern geometric-driven systems.
This work presents a new self-supervised training method and an architecture for the 3D scene flow estimation under occlusions.
arXiv Detail & Related papers (2021-04-10T09:55:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.