Related papers: ORMNet: Object-centric Relationship Modeling for Egocentric Hand-object Segmentation

ORMNet: Object-centric Relationship Modeling for Egocentric Hand-object Segmentation

URL: http://arxiv.org/abs/2407.05576v1
Date: Mon, 8 Jul 2024 03:17:10 GMT
Title: ORMNet: Object-centric Relationship Modeling for Egocentric Hand-object Segmentation
Authors: Yuejiao Su, Yi Wang, Lap-Pui Chau,
Abstract summary: Egocentric hand-object segmentation (EgoHOS) is a brand-new task aiming at segmenting the hands and interacting objects in the egocentric image. This paper proposes a novel end-to-end Object-centric Relationship Modeling Network (ORMNet) for EgoHOS.
Score: 14.765419467710812
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Egocentric hand-object segmentation (EgoHOS) is a brand-new task aiming at segmenting the hands and interacting objects in the egocentric image. Although significant advancements have been achieved by current methods, establishing an end-to-end model with high accuracy remains an unresolved challenge. Moreover, existing methods lack explicit modeling of the relationships between hands and objects as well as objects and objects, thereby disregarding critical information on hand-object interaction and introducing confusion into algorithms, ultimately leading to a reduction in segmentation performance. To address the limitations of existing methods, this paper proposes a novel end-to-end Object-centric Relationship Modeling Network (ORMNet) for EgoHOS. Specifically, based on a single-encoder and multi-decoder framework, we design the Hand-Object Relation (HOR) module to leverage hand-guided attention to capture the correlation between hands and objects and facilitate their representations. Moreover, based on the observed interrelationships between diverse categories of objects, we introduce the Object Relation Decoupling (ORD) strategy. This strategy allows the decoupling of the two-hand object during training, thereby alleviating the ambiguity of the network. Experimental results on three datasets show that the proposed ORMNet has notably exceptional segmentation performance with robust generalization capabilities.

Related papers

InterRVOS: Interaction-aware Referring Video Object Segmentation [37.53744746544299]
Referring video object segmentation aims to segment the object in a video corresponding to a given natural language expression.<n>In comprehensive video understanding, an object's role is often defined by its interactions with other entities.<n>We introduce Interaction-aware referring video object sgementation, a new task that requires segmenting both actor and target entities involved in an interaction.
arXiv Detail & Related papers (2025-06-03T01:16:13Z)
ObjectRelator: Enabling Cross-View Object Relation Understanding in Ego-Centric and Exo-Centric Videos [105.40690994956667]
Ego-Exo Object Correspondence task aims to map objects across ego-centric and exo-centric views. We introduce ObjectRelator, a novel method designed to tackle this task.
arXiv Detail & Related papers (2024-11-28T12:01:03Z)
Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues. Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z)
Spatial-Temporal Multi-level Association for Video Object Segmentation [89.32226483171047]
This paper proposes spatial-temporal multi-level association, which jointly associates reference frame, test frame, and object features. Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features.
arXiv Detail & Related papers (2024-04-09T12:44:34Z)
Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric Videos [31.620555223890626]
We study the problem of Short-Term Object interaction anticipation (STA) We propose NAOGAT, a multi-modal end-to-end transformer network, to guide the model to predict context-aware future actions. Our model outperforms existing methods on two separate datasets.
arXiv Detail & Related papers (2023-08-16T12:07:02Z)
InterTracker: Discovering and Tracking General Objects Interacting with Hands in the Wild [40.489171608114574]
Existing methods rely on frame-based detectors to locate interacting objects. We propose to leverage hand-object interaction to track interactive objects. Our proposed method outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-06T09:09:17Z)
Interacting Hand-Object Pose Estimation via Dense Mutual Attention [97.26400229871888]
3D hand-object pose estimation is the key to the success of many computer vision applications. We propose a novel dense mutual attention mechanism that is able to model fine-grained dependencies between the hand and the object. Our method is able to produce physically plausible poses with high quality and real-time inference speed.
arXiv Detail & Related papers (2022-11-16T10:01:33Z)
SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model. Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z)
Bi-directional Object-context Prioritization Learning for Saliency Ranking [60.62461793691836]
Existing approaches focus on learning either object-object or object-scene relations. We observe that spatial attention works concurrently with object-based attention in the human visual recognition system. We propose a novel bi-directional method to unify spatial attention and object-based attention for saliency ranking.
arXiv Detail & Related papers (2022-03-17T16:16:03Z)
Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video. With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies. The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z)
ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images. We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs. Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z)
A Graph-based Interactive Reasoning for Human-Object Interaction Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs. We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet. Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z)
Joint Hand-object 3D Reconstruction from a Single Image with Cross-branch Feature Fusion [78.98074380040838]
We propose to consider hand and object jointly in feature space and explore the reciprocity of the two branches. We employ an auxiliary depth estimation module to augment the input RGB image with the estimated depth map. Our approach significantly outperforms existing approaches in terms of the reconstruction accuracy of objects.
arXiv Detail & Related papers (2020-06-28T09:50:25Z)
A Deep Learning Approach to Object Affordance Segmentation [31.221897360610114]
We design an autoencoder that infers pixel-wise affordance labels in both videos and static images. Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism. We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF.
arXiv Detail & Related papers (2020-04-18T15:34:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.