Transformer-based Action recognition in hand-object interacting
scenarios
- URL: http://arxiv.org/abs/2210.11387v1
- Date: Thu, 20 Oct 2022 16:27:37 GMT
- Title: Transformer-based Action recognition in hand-object interacting
scenarios
- Authors: Hoseong Cho and Seungryul Baek
- Abstract summary: This report describes the 2nd place solution to the ECCV 2022 Human Body, Hands, and Activities (HBHA) from Egocentric and Multi-view Cameras Challenge: Action Recognition.
We propose a framework that estimates keypoints of two hands and an object with a Transformer-based keypoint estimator and recognizes actions based on the estimated keypoints.
- Score: 6.679721418508601
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report describes the 2nd place solution to the ECCV 2022 Human Body,
Hands, and Activities (HBHA) from Egocentric and Multi-view Cameras Challenge:
Action Recognition. This challenge aims to recognize hand-object interaction in
an egocentric view. We propose a framework that estimates keypoints of two
hands and an object with a Transformer-based keypoint estimator and recognizes
actions based on the estimated keypoints. We achieved a top-1 accuracy of
87.19% on the testset.
Related papers
- Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects [89.95728475983263]
holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation.
We design the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits.
Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks.
arXiv Detail & Related papers (2024-03-25T05:12:21Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Team I2R-VI-FF Technical Report on EPIC-KITCHENS VISOR Hand Object
Segmentation Challenge 2023 [12.266684016563733]
We present our approach to the EPIC-KITCHENS VISOR Hand Object Challenge.
Our approach combines the baseline method, Point-based Rendering (PointRend) and the Segment Anything Model (SAM)
By effectively combining the strengths of existing methods and applying our refinements, our submission achieved the 1st place in the VISOR HOS Challenge.
arXiv Detail & Related papers (2023-10-31T01:43:14Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Team VI-I2R Technical Report on EPIC-KITCHENS-100 Unsupervised Domain
Adaptation Challenge for Action Recognition 2022 [6.561596502471905]
We present our submission to the EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2022.
This task aims to adapt an action recognition model trained on a labeled source domain to an unlabeled target domain.
Our final submission achieved the first place in terms of top-1 action recognition accuracy.
arXiv Detail & Related papers (2023-01-29T12:29:24Z) - Transformer-based Global 3D Hand Pose Estimation in Two Hands
Manipulating Objects Scenarios [13.59950629234404]
This report describes our 1st place solution to ECCV 2022 challenge on Human Body, Hands, and Activities (HBHA) from Egocentric and Multi-view Cameras (hand pose estimation)
In this challenge, we aim to estimate global 3D hand poses from the input image where two hands and an object are interacting on the egocentric viewpoint.
Our proposed method performs end-to-end multi-hand pose estimation via transformer architecture.
arXiv Detail & Related papers (2022-10-20T16:24:47Z) - Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action
Recognition from Egocentric RGB Videos [50.74218823358754]
We develop a transformer-based framework to exploit temporal information for robust estimation.
We build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation.
Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O.
arXiv Detail & Related papers (2022-09-20T05:52:54Z) - Pose for Everything: Towards Category-Agnostic Pose Estimation [93.07415325374761]
Category-Agnostic Pose Estimation (CAPE) aims to create a pose estimation model capable of detecting the pose of any class of object given only a few samples with keypoint definition.
A transformer-based Keypoint Interaction Module (KIM) is proposed to capture both the interactions among different keypoints and the relationship between the support and query images.
We also introduce Multi-category Pose (MP-100) dataset, which is a 2D pose dataset of 100 object categories containing over 20K instances and is well-designed for developing CAPE algorithms.
arXiv Detail & Related papers (2022-07-21T09:40:54Z) - Sequential Decision-Making for Active Object Detection from Hand [43.839322860501596]
Key component of understanding hand-object interactions is the ability to identify the active object.
We set up our active object detection method as a sequential decision-making process conditioned on the location and appearance of the hands.
Key innovation of our approach is the design of the active object detection policy that uses an internal representation called the Box Field.
arXiv Detail & Related papers (2021-10-21T23:40:45Z) - One-Shot Object Affordance Detection in the Wild [76.46484684007706]
Affordance detection refers to identifying the potential action possibilities of objects in an image.
We devise a One-Shot Affordance Detection Network (OSAD-Net) that estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images.
With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods.
arXiv Detail & Related papers (2021-08-08T14:53:10Z) - One-Shot Affordance Detection [0.0]
Affordance detection refers to identifying the potential action possibilities of objects in an image.
To empower robots with this ability in unseen scenarios, we consider the challenging one-shot affordance detection problem.
We devise a One-Shot Affordance Detection (OS-AD) network that firstly estimates the purpose and then transfers it to help detect the common affordance.
arXiv Detail & Related papers (2021-06-28T14:22:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.