FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition
in Kitchen Scenes
- URL: http://arxiv.org/abs/2306.10858v1
- Date: Mon, 19 Jun 2023 11:21:59 GMT
- Title: FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition
in Kitchen Scenes
- Authors: Ting Zhe, Yongqian Li, Jing Zhang, Yong Luo, Han Hu, Bo Du, Yonggang
Wen, Dacheng Tao
- Abstract summary: We propose FHA-Kitchens, a novel dataset of fine-grained hand actions in kitchen scenes.
Our dataset consists of 2,377 video clips and 30,047 images collected from 8 different types of dishes.
Based on the constructed dataset, we benchmark representative action recognition and detection models.
- Score: 92.95591601807747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A typical task in the field of video understanding is hand action
recognition, which has a wide range of applications. Existing works either
mainly focus on full-body actions, or the defined action categories are
relatively coarse-grained. In this paper, we propose FHA-Kitchens, a novel
dataset of fine-grained hand actions in kitchen scenes. In particular, we focus
on human hand interaction regions and perform deep excavation to further refine
hand action information and interaction regions. Our FHA-Kitchens dataset
consists of 2,377 video clips and 30,047 images collected from 8 different
types of dishes, and all hand interaction regions in each image are labeled
with high-quality fine-grained action classes and bounding boxes. We represent
the action information in each hand interaction region as a triplet, resulting
in a total of 878 action triplets. Based on the constructed dataset, we
benchmark representative action recognition and detection models on the
following three tracks: (1) supervised learning for hand interaction region and
object detection, (2) supervised learning for fine-grained hand action
recognition, and (3) intra- and inter-class domain generalization for hand
interaction region detection. The experimental results offer compelling
empirical evidence that highlights the challenges inherent in fine-grained hand
action recognition, while also shedding light on potential avenues for future
research, particularly in relation to pre-training strategy, model design, and
domain generalization. The dataset will be released at
https://github.com/tingZ123/FHA-Kitchens.
Related papers
- ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding [31.481969919049472]
ActionArt is a fine-grained video-caption dataset designed to advance research in human-centric multimodal understanding.
Our dataset comprises thousands of videos capturing a broad spectrum of human actions, human-object interactions, and diverse scenarios.
We develop eight sub-tasks to evaluate the fine-grained understanding capabilities of existing large multimodal models across different dimensions.
arXiv Detail & Related papers (2025-04-25T08:05:32Z) - Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization [52.87635234206178]
This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization.
The framework incorporates two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM)
arXiv Detail & Related papers (2024-08-05T08:35:59Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - MSMG-Net: Multi-scale Multi-grained Supervised Metworks for Multi-task
Image Manipulation Detection and Localization [1.14219428942199]
A novel multi-scale multi-grained deep network (MSMG-Net) is proposed to automatically identify manipulated regions.
In our MSMG-Net, a parallel multi-scale feature extraction structure is used to extract multi-scale features.
The MSMG-Net can effectively perceive the object-level semantics and encode the edge artifact.
arXiv Detail & Related papers (2022-11-06T14:58:21Z) - Fine-grained Hand Gesture Recognition in Multi-viewpoint Hand Hygiene [3.588453140011797]
This paper contributes a new high-quality dataset for hand gesture recognition in hand hygiene systems, named "MFH"
To address the aforementioned issues, the MFH dataset is proposed to contain a total of 731147 samples obtained by different camera views in 6 non-overlapping locations.
arXiv Detail & Related papers (2021-09-07T08:14:15Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - Human Action Recognition Based on Multi-scale Feature Maps from Depth
Video Sequences [12.30399970340689]
We present a novel framework focusing on multi-scale motion information to recognize human actions from depth video sequences.
We employ depth motion images (DMI) as the templates to generate the multi-scale static representation of actions.
We extract the multi-granularity descriptor called LP-DMI-HOG to provide more discriminative features.
arXiv Detail & Related papers (2021-01-19T13:46:42Z) - FineGym: A Hierarchical Video Dataset for Fine-grained Action
Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos.
It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy.
This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.