Related papers: Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose

Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose

URL: http://arxiv.org/abs/2509.16557v1
Date: Sat, 20 Sep 2025 07:27:32 GMT
Title: Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose
Authors: Muhammad Hamza, Danish Hamid, Muhammad Tahir Akram,
Abstract summary: This research introduces I2S, a framework designed for unobtrusive user identification through human object interaction recognition.<n>I2S utilizes handcrafted features extracted from 3D hand poses and per forms sequential feature augmentation.<n>I2S demonstrates state-of-the-art performance while maintaining a lightweight model size of under 4 MB and a fast inference time of 0.1 seconds.
Score: 0.4779196219827507
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human-Object Interaction Recognition (HOIR) and user identification play a crucial role in advancing augmented reality (AR)-based personalized assistive technologies. These systems are increasingly being deployed in high-stakes, human-centric environments such as aircraft cockpits, aerospace maintenance, and surgical procedures. This research introduces I2S (Interact2Sign), a multi stage framework designed for unobtrusive user identification through human object interaction recognition, leveraging 3D hand pose analysis in egocentric videos. I2S utilizes handcrafted features extracted from 3D hand poses and per forms sequential feature augmentation: first identifying the object class, followed by HOI recognition, and ultimately, user identification. A comprehensive feature extraction and description process was carried out for 3D hand poses, organizing the extracted features into semantically meaningful categories: Spatial, Frequency, Kinematic, Orientation, and a novel descriptor introduced in this work, the Inter-Hand Spatial Envelope (IHSE). Extensive ablation studies were conducted to determine the most effective combination of features. The optimal configuration achieved an impressive average F1-score of 97.52% for user identification, evaluated on a bimanual object manipulation dataset derived from the ARCTIC and H2O datasets. I2S demonstrates state-of-the-art performance while maintaining a lightweight model size of under 4 MB and a fast inference time of 0.1 seconds. These characteristics make the proposed framework highly suitable for real-time, on-device authentication in security-critical, AR-based systems.

Related papers

SAM 3D Body: Robust Full-Body Human Mesh Recovery [65.0108906331903]
We introduce SAM 3D Body (3DB), a promptable model for single-image full-body 3D human mesh recovery (HMR)<n>3DB estimates the human pose of the body, feet, and hands.<n>It is the first model to use a new parametric mesh representation, Momentum Human Rig (MHR), which decouples skeletal structure and surface shape.
arXiv Detail & Related papers (2026-02-17T20:26:37Z)
InterAct: Advancing Large-Scale Versatile 3D Human-Object Interaction Generation [54.09384502044162]
We introduce InterAct, a large-scale 3D HOI benchmark featuring dataset and methodological advancements.<n>First, we consolidate and standardize 21.81 hours of HOI data from diverse sources, enriching it with detailed textual annotations.<n>Second, we propose a unified optimization framework to enhance data quality by reducing artifacts and correcting hand motions.<n>Third, we define six benchmarking tasks and develop a unified HOI generative modeling perspective, achieving state-of-the-art performance.
arXiv Detail & Related papers (2025-09-11T15:43:54Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.<n>We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.<n>We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network [2.223052975765005]
We propose a novel Pyramid Graph Convolutional Network (PGCN) to automatically recognize human-object interaction. The system represents the 2D or 3D spatial relation of human and objects from the detection results in video data as a graph. We evaluate our model on two challenging datasets in the field of human-object interaction recognition.
arXiv Detail & Related papers (2024-10-10T13:39:17Z)
HOIMotion: Forecasting Human Motion During Human-Object Interactions Using Egocentric 3D Object Bounding Boxes [10.237077867790612]
We present HOIMotion, a novel approach for human motion forecasting during human-object interactions. Our method integrates information about past body poses and egocentric 3D object bounding boxes. We show that HOIMotion consistently outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2024-07-02T19:58:35Z)
In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition [1.4732811715354455]
Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort. Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor. We introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective.
arXiv Detail & Related papers (2024-04-14T17:33:33Z)
HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields [96.04424738803667]
HOISDF is a guided hand-object pose estimation network. It exploits hand and object SDFs to provide a global, implicit representation over the complete reconstruction volume. We show that HOISDF achieves state-of-the-art results on hand-object pose estimation benchmarks.
arXiv Detail & Related papers (2024-02-26T22:48:37Z)
Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition [45.0131792009999]
We propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition. Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information. Our network outperforms state-of-the-art approaches in most standard evaluation settings.
arXiv Detail & Related papers (2023-07-22T03:51:32Z)
Human Action Recognition in Egocentric Perspective Using 2D Object and Hands Pose [2.0305676256390934]
Egocentric action recognition is essential for healthcare and assistive technology that relies on egocentric cameras. This study explores the feasibility of using 2D hand and object pose information for egocentric action recognition.
arXiv Detail & Related papers (2023-06-08T12:15:16Z)
Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition from Egocentric RGB Videos [50.74218823358754]
We develop a transformer-based framework to exploit temporal information for robust estimation. We build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O.
arXiv Detail & Related papers (2022-09-20T05:52:54Z)
A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles. An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach. We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z)
Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data. Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z)
Selective Spatio-Temporal Aggregation Based Pose Refinement System: Towards Understanding Human Activities in Real-World Videos [8.571131862820833]
State-of-the-art pose estimators struggle in obtaining high-quality 2D or 3D pose data due to truncation and low-resolution in real-world un-annotated videos. We propose a Selective Spatio-Temporal Aggregation mechanism, named SST-A, that refines and smooths the keypoint locations extracted by multiple expert pose estimators. We demonstrate that the skeleton data refined by our Pose-Refinement system (SSTA-PRS) is effective at boosting various existing action recognition models.
arXiv Detail & Related papers (2020-11-10T19:19:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.