FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition
in Kitchen Scenes
- URL: http://arxiv.org/abs/2306.10858v1
- Date: Mon, 19 Jun 2023 11:21:59 GMT
- Title: FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition
in Kitchen Scenes
- Authors: Ting Zhe, Yongqian Li, Jing Zhang, Yong Luo, Han Hu, Bo Du, Yonggang
Wen, Dacheng Tao
- Abstract summary: We propose FHA-Kitchens, a novel dataset of fine-grained hand actions in kitchen scenes.
Our dataset consists of 2,377 video clips and 30,047 images collected from 8 different types of dishes.
Based on the constructed dataset, we benchmark representative action recognition and detection models.
- Score: 92.95591601807747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A typical task in the field of video understanding is hand action
recognition, which has a wide range of applications. Existing works either
mainly focus on full-body actions, or the defined action categories are
relatively coarse-grained. In this paper, we propose FHA-Kitchens, a novel
dataset of fine-grained hand actions in kitchen scenes. In particular, we focus
on human hand interaction regions and perform deep excavation to further refine
hand action information and interaction regions. Our FHA-Kitchens dataset
consists of 2,377 video clips and 30,047 images collected from 8 different
types of dishes, and all hand interaction regions in each image are labeled
with high-quality fine-grained action classes and bounding boxes. We represent
the action information in each hand interaction region as a triplet, resulting
in a total of 878 action triplets. Based on the constructed dataset, we
benchmark representative action recognition and detection models on the
following three tracks: (1) supervised learning for hand interaction region and
object detection, (2) supervised learning for fine-grained hand action
recognition, and (3) intra- and inter-class domain generalization for hand
interaction region detection. The experimental results offer compelling
empirical evidence that highlights the challenges inherent in fine-grained hand
action recognition, while also shedding light on potential avenues for future
research, particularly in relation to pre-training strategy, model design, and
domain generalization. The dataset will be released at
https://github.com/tingZ123/FHA-Kitchens.
Related papers
- ADL4D: Towards A Contextually Rich Dataset for 4D Activities of Daily
Living [4.221961702292134]
ADL4D is a dataset of up to two subjects inter- acting with different sets of objects performing Activities of Daily Living (ADL)
Our dataset consists of 75 sequences with a total of 1.1M RGB-D frames, hand and object poses, and per-hand fine-grained action annotations.
We develop an automatic system for multi-view multi-hand 3D pose an- notation capable of tracking hand poses over time.
arXiv Detail & Related papers (2024-02-27T18:51:52Z) - CaSAR: Contact-aware Skeletal Action Recognition [47.249908147135855]
We present a new framework called Contact-aware Skeletal Action Recognition (CaSAR)
CaSAR uses novel representations of hand-object interaction that encompass spatial information.
Our framework is able to learn how the hands touch or stay away from the objects for each frame of the action sequence, and use this information to predict the action class.
arXiv Detail & Related papers (2023-09-17T09:42:40Z) - ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action
Understanding [8.923830513183882]
We present the ATTACH dataset, which contains 51.6 hours of assembly with 95.2k annotated fine-grained actions monitored by three cameras.
In the ATTACH dataset, more than 68% of annotations overlap with other annotations, which is many times more than in related datasets.
We report the performance of state-of-the-art methods for action recognition as well as action detection on video and skeleton-sequence inputs.
arXiv Detail & Related papers (2023-04-17T12:31:24Z) - ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding [67.21613160846299]
Embodied Reference Understanding (ERU) is first designed for this concern.
New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
arXiv Detail & Related papers (2023-03-23T11:36:14Z) - Skeleton-Based Mutually Assisted Interacted Object Localization and
Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data.
Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z) - Spot What Matters: Learning Context Using Graph Convolutional Networks
for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video.
Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training.
Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z) - Learning to Disambiguate Strongly Interacting Hands via Probabilistic
Per-pixel Part Segmentation [84.28064034301445]
Self-similarity, and the resulting ambiguities in assigning pixel observations to the respective hands, is a major cause of the final 3D pose error.
We propose DIGIT, a novel method for estimating the 3D poses of two interacting hands from a single monocular image.
We experimentally show that the proposed approach achieves new state-of-the-art performance on the InterHand2.6M dataset.
arXiv Detail & Related papers (2021-07-01T13:28:02Z) - Self-supervised Human Detection and Segmentation via Multi-view
Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training.
We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.