Egocentric Hand-object Interaction Detection and Application
- URL: http://arxiv.org/abs/2109.14734v1
- Date: Wed, 29 Sep 2021 21:47:16 GMT
- Title: Egocentric Hand-object Interaction Detection and Application
- Authors: Yao Lu, Walterio W. Mayol-Cuevas
- Abstract summary: We present a method to detect the hand-object interaction from an egocentric perspective.
We train networks predicting hand pose, hand mask and in-hand object mask to jointly predict the hand-object interaction status.
Our method can run over $textbf30$ FPS which is much efficient than Shan's ($textbf1simtextbf2$ FPS)
- Score: 24.68535915849555
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we present a method to detect the hand-object interaction from
an egocentric perspective. In contrast to massive data-driven discriminator
based method like \cite{Shan20}, we propose a novel workflow that utilises the
cues of hand and object. Specifically, we train networks predicting hand pose,
hand mask and in-hand object mask to jointly predict the hand-object
interaction status. We compare our method with the most recent work from Shan
et al. \cite{Shan20} on selected images from EPIC-KITCHENS
\cite{damen2018scaling} dataset and achieve $89\%$ accuracy on HOI (hand-object
interaction) detection which is comparative to Shan's ($92\%$). However, for
real-time performance, with the same machine, our method can run over
$\textbf{30}$ FPS which is much efficient than Shan's
($\textbf{1}\sim\textbf{2}$ FPS). Furthermore, with our approach, we are able
to segment script-less activities from where we extract the frames with the HOI
status detection. We achieve $\textbf{68.2\%}$ and $\textbf{82.8\%}$ F1 score
on GTEA \cite{fathi2011learning} and the UTGrasp \cite{cai2015scalable} dataset
respectively which are all comparative to the SOTA methods.
Related papers
- You Only Estimate Once: Unified, One-stage, Real-Time Category-level Articulated Object 6D Pose Estimation for Robotic Grasping [119.41166438439313]
YOEO is a single-stage method that outputs instance segmentation and NPCS representations in an end-to-end manner.<n>We use a unified network to generate point-wise semantic labels and centroid offsets, allowing points from the same part instance to vote for the same centroid.<n>We also deploy our synthetically-trained model in a real-world setting, providing real-time visual feedback at 200Hz.
arXiv Detail & Related papers (2025-06-06T03:49:20Z) - Robot Instance Segmentation with Few Annotations for Grasping [10.005879464111915]
We propose a novel framework that combines Semi-Supervised Learning (SSL) with Learning Through Interaction (LTI)
Our approach exploits partially annotated data through self-supervision and incorporates temporal context using pseudo-sequences generated from unlabeled still images.
We validate our method on two common benchmarks, ARMBench mix-object-tote and OCID, where it achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-07-01T13:58:32Z) - Vision Transformer with Sparse Scan Prior [57.37893387775829]
Inspired by the human eye's sparse scanning mechanism, we propose a textbfSparse textbfScan textbfSelf-textbfAttention mechanism.
This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors.
Building on $rmS3rmA$, we introduce the textbfSparse textbfScan textbfVision
arXiv Detail & Related papers (2024-05-22T04:34:36Z) - FreeA: Human-object Interaction Detection using Free Annotation Labels [9.47028064037262]
FreeA is a self-adaptive, language-driven HOI detection method.<n>It generates latent HOI labels without requiring manual annotation.<n>It achieves state-of-the-art performance among weakly supervised HOI competitors.
arXiv Detail & Related papers (2024-03-04T08:38:15Z) - Exploring the Limits of Deep Image Clustering using Pretrained Models [1.1060425537315088]
We present a methodology that learns to classify images without labels by leveraging pretrained feature extractors.
We propose a novel objective that learns associations between image features by introducing a variant of pointwise mutual information together with instance weighting.
arXiv Detail & Related papers (2023-03-31T08:56:29Z) - Egocentric Hand-object Interaction Detection [13.639883596251313]
We use a multi-cam system to capture hand pose data from multiple perspectives.
Our method can run over $textbf30$ FPS which is much more efficient than Shan's.
arXiv Detail & Related papers (2022-11-16T17:31:40Z) - Interacting Hand-Object Pose Estimation via Dense Mutual Attention [97.26400229871888]
3D hand-object pose estimation is the key to the success of many computer vision applications.
We propose a novel dense mutual attention mechanism that is able to model fine-grained dependencies between the hand and the object.
Our method is able to produce physically plausible poses with high quality and real-time inference speed.
arXiv Detail & Related papers (2022-11-16T10:01:33Z) - Semantic keypoint-based pose estimation from single RGB frames [64.80395521735463]
We present an approach to estimating the continuous 6-DoF pose of an object from a single RGB image.
The approach combines semantic keypoints predicted by a convolutional network (convnet) with a deformable shape model.
We show that our approach can accurately recover the 6-DoF object pose for both instance- and class-based scenarios.
arXiv Detail & Related papers (2022-04-12T15:03:51Z) - End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge
Distillation [86.41437210485932]
We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously.
We propose a novel end-to-end zero-shot HOI Detection framework via vision-language knowledge distillation.
Our method outperforms the previous SOTA by 8.92% on unseen mAP and 10.18% on overall mAP.
arXiv Detail & Related papers (2022-04-01T07:27:19Z) - Complex Scene Image Editing by Scene Graph Comprehension [17.72638225034884]
We propose a two-stage method for achieving complex scene image editing by Scene Graph (SGC-Net)
In the first stage, we train a Region of Interest (RoI) prediction network that uses scene graphs and predict the locations of the target objects.
The second stage uses a conditional diffusion model to edit the image based on our RoI predictions.
arXiv Detail & Related papers (2022-03-24T05:12:54Z) - Understanding Egocentric Hand-Object Interactions from Hand Pose
Estimation [24.68535915849555]
We propose a method to label a dataset which contains the egocentric images pair-wisely.
We also use the collected pairwise data to train our encoder-decoder style network which has been proven efficient in.
arXiv Detail & Related papers (2021-09-29T18:34:06Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - SCAN: Learning to Classify Images without Labels [73.69513783788622]
We advocate a two-step approach where feature learning and clustering are decoupled.
A self-supervised task from representation learning is employed to obtain semantically meaningful features.
We obtain promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime.
arXiv Detail & Related papers (2020-05-25T18:12:33Z) - SaccadeNet: A Fast and Accurate Object Detector [76.36741299193568]
We propose a fast and accurate object detector called textitSaccadeNet.
It contains four main modules, the cenam, the coram, the atm, and the aggatt, which allows it to attend to different informative object keypoints.
Among all the real-time object detectors, %that can run faster than 25 FPS, our SaccadeNet achieves the best detection performance.
arXiv Detail & Related papers (2020-03-26T19:47:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.