Related papers: A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains

A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains

URL: http://arxiv.org/abs/2507.13326v1
Date: Thu, 17 Jul 2025 17:45:09 GMT
Title: A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains
Authors: Antonio Finocchiaro, Alessandro Sebastiano Catinello, Michele Mazzamuto, Rosario Leonardi, Antonino Furnari, Giovanni Maria Farinella,
Abstract summary: We propose an efficient approach for detecting hand-objects interactions from streaming egocentric vision.<n>Our approach consists of an action recognition module and an object detection module for identifying active objects upon confirmed interaction.
Score: 48.42136244433369
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hand-object interaction detection remains an open challenge in real-time applications, where intuitive user experiences depend on fast and accurate detection of interactions with surrounding objects. We propose an efficient approach for detecting hand-objects interactions from streaming egocentric vision that operates in real time. Our approach consists of an action recognition module and an object detection module for identifying active objects upon confirmed interaction. Our Mamba model with EfficientNetV2 as backbone for action recognition achieves 38.52% p-AP on the ENIGMA-51 benchmark at 30fps, while our fine-tuned YOLOWorld reaches 85.13% AP for hand and object. We implement our models in a cascaded architecture where the action recognition and object detection modules operate sequentially. When the action recognition predicts a contact state, it activates the object detection module, which in turn performs inference on the relevant frame to detect and classify the active object.

Related papers

Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking [57.942404069484134]
Articulated object manipulation requires precise object interaction, where the object's axis must be carefully considered.<n>Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics.<n>We present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds.
arXiv Detail & Related papers (2024-09-24T17:59:56Z)
Uncertainty-Guided Appearance-Motion Association Network for Out-of-Distribution Action Detection [4.938957922033169]
Out-of-distribution (OOD) detection targets to detect and reject test samples with semantic shifts.<n>We propose a novel Uncertainty-Guided Appearance-Motion Association Network (UAAN)<n>We show that UAAN beats state-of-the-art methods by a significant margin, illustrating its effectiveness.
arXiv Detail & Related papers (2024-09-16T02:53:49Z)
Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition [21.655278000690686]
We propose an end-to-end object-centric action recognition framework. It simultaneously performs Detection And Interaction Reasoning in one stage. We conduct experiments on two datasets, Something-Else and Ikea-Assembly.
arXiv Detail & Related papers (2024-04-18T05:06:12Z)
SeMoLi: What Moves Together Belongs Together [51.72754014130369]
We tackle semi-supervised object detection based on motion cues. Recent results suggest that motion-based clustering methods can be used to pseudo-label instances of moving objects. We re-think this approach and suggest that both, object detection, as well as motion-inspired pseudo-labeling, can be tackled in a data-driven manner.
arXiv Detail & Related papers (2024-02-29T18:54:53Z)
Object-Centric Multiple Object Tracking [124.30650395969126]
This paper proposes a video object-centric model for multiple-object tracking pipelines. It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module. Benefited from object-centric learning, we only require sparse detection labels for object localization and feature binding.
arXiv Detail & Related papers (2023-09-01T03:34:12Z)
Skeleton-Based Mutually Assisted Interacted Object Localization and Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data. Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z)
Sequential Decision-Making for Active Object Detection from Hand [43.839322860501596]
Key component of understanding hand-object interactions is the ability to identify the active object. We set up our active object detection method as a sequential decision-making process conditioned on the location and appearance of the hands. Key innovation of our approach is the design of the active object detection policy that uses an internal representation called the Box Field.
arXiv Detail & Related papers (2021-10-21T23:40:45Z)
Object-Driven Active Mapping for More Accurate Object Pose Estimation and Robotic Grasping [5.385583891213281]
The framework is built on an object SLAM system integrated with a simultaneous multi-object pose estimation process. By combining the mapping module and the exploration strategy, an accurate object map that is compatible with robotic grasping can be generated.
arXiv Detail & Related papers (2020-12-03T09:36:55Z)
Slender Object Detection: Diagnoses and Improvements [74.40792217534]
In this paper, we are concerned with the detection of a particular type of objects with extreme aspect ratios, namely textbfslender objects. For a classical object detection method, a drastic drop of $18.9%$ mAP on COCO is observed, if solely evaluated on slender objects.
arXiv Detail & Related papers (2020-11-17T09:39:42Z)
Attention-Oriented Action Recognition for Real-Time Human-Robot Interaction [11.285529781751984]
We propose an attention-oriented multi-level network framework to meet the need for real-time interaction. Specifically, a Pre-Attention network is employed to roughly focus on the interactor in the scene at low resolution. The other compact CNN receives the extracted skeleton sequence as input for action recognition.
arXiv Detail & Related papers (2020-07-02T12:41:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.