Related papers: Egocentric Human-Object Interaction Detection: A New Benchmark and Method

Egocentric Human-Object Interaction Detection: A New Benchmark and Method

URL: http://arxiv.org/abs/2506.14189v2
Date: Tue, 26 Aug 2025 15:13:05 GMT
Title: Egocentric Human-Object Interaction Detection: A New Benchmark and Method
Authors: Kunyuan Deng, Yi Wang, Lap-Pui Chau,
Abstract summary: Egocentric human-object interaction (Ego-HOI) detection is crucial for intelligent agents to understand and assist human activities from a first-person perspective.<n>We introduce the real-world Ego-HOI detection task and Ego-HOIBench, a new dataset with over 27K egocentric images and explicit, fine-grained hand-verb-object triplet annotations.<n>We propose Hand Geometry and Interactivity Refinement (HGIR), a lightweight, plug-and-play scheme that leverages hand pose and geometric cues to enhance interaction representations.
Score: 15.271558280695631
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Egocentric human-object interaction (Ego-HOI) detection is crucial for intelligent agents to understand and assist human activities from a first-person perspective. However, progress has been hindered by the lack of benchmarks and methods tailored to egocentric challenges such as severe hand-object occlusion. In this paper, we introduce the real-world Ego-HOI detection task and the accompanying Ego-HOIBench, a new dataset with over 27K egocentric images and explicit, fine-grained hand-verb-object triplet annotations across 123 categories. Ego-HOIBench covers diverse daily scenarios, object types, and both single- and two-hand interactions, offering a comprehensive testbed for Ego-HOI research. Benchmarking existing third-person HOI detectors on Ego-HOIBench reveals significant performance gaps, highlighting the need for egocentric-specific solutions. To this end, we propose Hand Geometry and Interactivity Refinement (HGIR), a lightweight, plug-and-play scheme that leverages hand pose and geometric cues to enhance interaction representations. Specifically, HGIR explicitly extracts global hand geometric features from the estimated hand pose proposals, and further refines interaction features through pose-interaction attention, enabling the model to focus on subtle hand-object relationship differences even under severe occlusion. HGIR significantly improves Ego-HOI detection performance across multiple baselines, achieving new state-of-the-art results on Ego-HOIBench. Our dataset and method establish a solid foundation for future research in egocentric vision and human-object interaction understanding. Project page: https://dengkunyuan.github.io/EgoHOIBench/

Related papers

Robust Egocentric Referring Video Object Segmentation via Dual-Modal Causal Intervention [58.05340906967343]
Egocentric Referring Video Object (Ego-RVOS) aims to segment the specific object actively involved in a human action, as described by a language query, within first-person videos.<n>Existing methods often struggle, learning spurious correlations from skewed object-action pairings in datasets.<n>We introduce Causal-REferring (CERES), a plug-in causal framework that adapts strong, pre-trained RVOSs to the egocentric domain.
arXiv Detail & Related papers (2025-12-30T16:22:14Z)
ECHO: Ego-Centric modeling of Human-Object interactions [71.17118015822699]
ECHO (Ego-Centric modeling of Human-Object interactions) is developed.<n>It recovers three modalities: human pose, object motion, and contact from such minimal observation.<n>It outperforms existing methods that do not offer the same flexibility.
arXiv Detail & Related papers (2025-08-29T12:12:22Z)
Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions [110.43343503158306]
This paper embeds the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands.<n>Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data.<n>We establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis.
arXiv Detail & Related papers (2025-08-06T17:46:23Z)
EgoWorld: Translating Exocentric View to Egocentric View using Rich Exocentric Observations [4.252119151012245]
We introduce EgoWorld, a novel framework that reconstructs an egocentric view from rich exocentric observations.<n>Our approach reconstructs a point cloud from estimated exocentric depth maps, reprojects it into the egocentric perspective, and then applies diffusion-based inpainting to produce dense, semantically coherent egocentric images.<n>EgoWorld achieves state-of-the-art performance and demonstrates robust generalization to new objects, actions, scenes, and subjects.
arXiv Detail & Related papers (2025-06-22T04:21:48Z)
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction [16.338872733140832]
This paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG)<n>Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding.<n>The Ego-IRGBench dataset includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions.
arXiv Detail & Related papers (2025-04-02T08:24:35Z)
Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues. Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z)
Grounding 3D Scene Affordance From Egocentric Interactions [52.5827242925951]
Grounding 3D scene affordance aims to locate interactive regions in 3D environments. We introduce a novel task: grounding 3D scene affordance from egocentric interactions.
arXiv Detail & Related papers (2024-09-29T10:46:19Z)
CaRe-Ego: Contact-aware Relationship Modeling for Egocentric Interactive Hand-object Segmentation [14.765419467710812]
Egocentric Interactive hand-object segmentation (EgoIHOS) is crucial for understanding human behavior in assistive systems.<n>Previous methods recognize hands and interacting objects as distinct semantic categories based solely on visual features.<n>We propose CaRe-Ego, which emphasizes the contact between hands and objects from two aspects.
arXiv Detail & Related papers (2024-07-08T03:17:10Z)
EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views [51.53089073920215]
Understanding egocentric human-object interaction (HOI) is a fundamental aspect of human-centric perception. Existing methods primarily leverage observations of HOI to capture interaction regions from an exocentric view. We present EgoChoir, which links object structures with interaction contexts inherent in appearance and head motion to reveal object affordance.
arXiv Detail & Related papers (2024-05-22T14:03:48Z)
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects [89.95728475983263]
holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. We design the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits. Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks.
arXiv Detail & Related papers (2024-03-25T05:12:21Z)
Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding. Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction. Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z)
EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding [99.904140768186]
This paper proposes a new framework as an infrastructure to advance Ego-HOI recognition by Probing, Curation and Adaption (EgoPCA) We contribute comprehensive pre-train sets, balanced test sets and a new baseline, which are complete with a training-finetuning strategy. We believe our data and the findings will pave a new way for Ego-HOI understanding.
arXiv Detail & Related papers (2023-09-05T17:51:16Z)
Geometric Features Informed Multi-person Human-object Interaction Recognition in Videos [19.64072251418535]
We argue to combine the benefits of both visual and geometric features in HOI recognition. We propose a novel Two-level Geometric feature-informed Graph Convolutional Network (2G-GCN) To demonstrate the novelty and effectiveness of our method in challenging scenarios, we propose a new multi-person HOI dataset (MPHOI-72)
arXiv Detail & Related papers (2022-07-19T17:36:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.