Related papers: PCIE_Pose Solution for EgoExo4D Pose and Proficiency Estimation Challenge

PCIE_Pose Solution for EgoExo4D Pose and Proficiency Estimation Challenge

URL: http://arxiv.org/abs/2505.24411v1
Date: Fri, 30 May 2025 09:51:04 GMT
Title: PCIE_Pose Solution for EgoExo4D Pose and Proficiency Estimation Challenge
Authors: Feng Chen, Kanokphan Lertniphonphan, Qiancheng Yan, Xiaohui Fan, Jun Xie, Tao Zhang, Zhepeng Wang,
Abstract summary: This report focuses on the task of estimating 21 3D hand joints from RGB egocentric videos.<n>We developed the Hand Pose Vision Transformer (HPCIE-T+) to refine hand pose predictions.<n>For the EgoD Body Pose Challenge, we adopted a multimodal syn-temporal feature integration strategy.<n>Our methods achieved remarkable performance: 8.31 PA-MPJPE in the Hand Pose Challenge and 11.25 MPJPE in the Body Pose Challenge.
Score: 26.194108651583466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This report introduces our team's (PCIE_EgoPose) solutions for the EgoExo4D Pose and Proficiency Estimation Challenges at CVPR2025. Focused on the intricate task of estimating 21 3D hand joints from RGB egocentric videos, which are complicated by subtle movements and frequent occlusions, we developed the Hand Pose Vision Transformer (HP-ViT+). This architecture synergizes a Vision Transformer and a CNN backbone, using weighted fusion to refine the hand pose predictions. For the EgoExo4D Body Pose Challenge, we adopted a multimodal spatio-temporal feature integration strategy to address the complexities of body pose estimation across dynamic contexts. Our methods achieved remarkable performance: 8.31 PA-MPJPE in the Hand Pose Challenge and 11.25 MPJPE in the Body Pose Challenge, securing championship titles in both competitions. We extended our pose estimation solutions to the Proficiency Estimation task, applying core technologies such as transformer-based architectures. This extension enabled us to achieve a top-1 accuracy of 0.53, a SOTA result, in the Demonstrator Proficiency Estimation competition.

Related papers

Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop [120.2806035123366]
RoboTwin Dual-Arm Collaboration Challenge was held at the 2nd MEIS Workshop, CVPR 2025.<n>Competitors totally tackled 17 dual-arm manipulation tasks, covering rigid, deformable, and tactile-based scenarios.<n>Report outlines the competition setup, task design, evaluation methodology, key findings and future direction.
arXiv Detail & Related papers (2025-06-29T17:56:41Z)
FRAME: Floor-aligned Representation for Avatar Motion from Egocentric Video [52.33896173943054]
Egocentric motion capture with a head-mounted body-facing stereo camera is crucial for VR and AR applications.<n>Existing methods rely on synthetic pretraining and struggle to generate smooth and accurate predictions in real-world settings.<n>We propose FRAME, a simple yet effective architecture that combines device pose and camera feeds for state-of-the-art body pose prediction.
arXiv Detail & Related papers (2025-03-29T14:26:06Z)
Estimating Body and Hand Motion in an Ego-sensed World [62.61989004520802]
We present EgoAllo, a system for human motion estimation from a head-mounted device.<n>Using only egocentric SLAM poses and images, EgoAllo guides sampling from a conditional diffusion model to estimate 3D body pose, height, and hand parameters.
arXiv Detail & Related papers (2024-10-04T17:59:57Z)
DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image [98.29284902879652]
We present DICE, the first end-to-end method for Deformation-aware hand-face Interaction reCovEry from a single image.<n>It features disentangling the regression of local deformation fields and global mesh locations into two network branches.<n>It achieves state-of-the-art performance on a standard benchmark and in-the-wild data in terms of accuracy and physical plausibility.
arXiv Detail & Related papers (2024-06-26T00:08:29Z)
PCIE_EgoHandPose Solution for EgoExo4D Hand Pose Challenge [12.31892993103657]
The main goal of the challenge is to accurately estimate hand poses, which involve 21 3D joints, using an RGB egocentric video image. To handle the complexity of the task, we propose the Hand Pose Vision Transformer (HP-ViT) The HP-ViT comprises a ViT backbone and transformer head to estimate joint positions in 3D, utilizing MPJPE and RLE loss function. Our approach achieved the 1st position in the Hand Pose challenge with 25.51 MPJPE and 8.49 PA-MPJPE.
arXiv Detail & Related papers (2024-06-18T02:41:32Z)
EgoPoseFormer: A Simple Baseline for Stereo Egocentric 3D Human Pose Estimation [15.590340765703893]
We present EgoPoseFormer, a transformer-based model for stereo egocentric human pose estimation. Our approach overcomes the main challenge of overcoming joint invisibility caused by self-occlusion or a limited field of view (FOV) of head-mounted cameras. We evaluate our method on the stereo UnrealEgo dataset and show it significantly outperforms previous approaches.
arXiv Detail & Related papers (2024-03-26T20:02:48Z)
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects [89.95728475983263]
holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. We design the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits. Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks.
arXiv Detail & Related papers (2024-03-25T05:12:21Z)
Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement [65.08165593201437]
We explore egocentric whole-body motion capture using a single fisheye camera, which simultaneously estimates human body and hand motion. This task presents significant challenges due to the lack of high-quality datasets, fisheye camera distortion, and human body self-occlusion. We propose a novel approach that leverages FisheyeViT to extract fisheye image features, which are converted into pixel-aligned 3D heatmap representations for 3D human body pose prediction.
arXiv Detail & Related papers (2023-11-28T07:13:47Z)
1st Place Solution of Egocentric 3D Hand Pose Estimation Challenge 2023 Technical Report:A Concise Pipeline for Egocentric Hand Pose Reconstruction [11.551318550321938]
Using AssemblyHands, this challenge focuses on egocentric 3D hand pose estimation from a single-view image. We adopt ViT based backbones and a simple regressor for 3D keypoints prediction, which provides strong model baselines. Our method achieved 12.21mm MPJPE on test dataset, achieve the first place in Egocentric 3D Hand Pose Estimation challenge.
arXiv Detail & Related papers (2023-10-07T10:25:50Z)
Building Spatio-temporal Transformers for Egocentric 3D Pose Estimation [9.569752078386006]
We leverage information from past frames to guide our self-attention-based 3D estimation procedure -- Ego-STAN. Specifically, we build atemporal Transformer model that attends to semantically rich convolutional neural network-based feature maps. We demonstrate Ego-STAN's superior performance on the xR-EgoPose dataset.
arXiv Detail & Related papers (2022-06-09T22:33:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.