Related papers: FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition in Kitchen Scenes

FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition in Kitchen Scenes

URL: http://arxiv.org/abs/2306.10858v1
Date: Mon, 19 Jun 2023 11:21:59 GMT
Title: FHA-Kitchens: A Novel Dataset for Fine-Grained Hand Action Recognition in Kitchen Scenes
Authors: Ting Zhe, Yongqian Li, Jing Zhang, Yong Luo, Han Hu, Bo Du, Yonggang Wen, Dacheng Tao
Abstract summary: We propose FHA-Kitchens, a novel dataset of fine-grained hand actions in kitchen scenes. Our dataset consists of 2,377 video clips and 30,047 images collected from 8 different types of dishes. Based on the constructed dataset, we benchmark representative action recognition and detection models.
Score: 92.95591601807747
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A typical task in the field of video understanding is hand action recognition, which has a wide range of applications. Existing works either mainly focus on full-body actions, or the defined action categories are relatively coarse-grained. In this paper, we propose FHA-Kitchens, a novel dataset of fine-grained hand actions in kitchen scenes. In particular, we focus on human hand interaction regions and perform deep excavation to further refine hand action information and interaction regions. Our FHA-Kitchens dataset consists of 2,377 video clips and 30,047 images collected from 8 different types of dishes, and all hand interaction regions in each image are labeled with high-quality fine-grained action classes and bounding boxes. We represent the action information in each hand interaction region as a triplet, resulting in a total of 878 action triplets. Based on the constructed dataset, we benchmark representative action recognition and detection models on the following three tracks: (1) supervised learning for hand interaction region and object detection, (2) supervised learning for fine-grained hand action recognition, and (3) intra- and inter-class domain generalization for hand interaction region detection. The experimental results offer compelling empirical evidence that highlights the challenges inherent in fine-grained hand action recognition, while also shedding light on potential avenues for future research, particularly in relation to pre-training strategy, model design, and domain generalization. The dataset will be released at https://github.com/tingZ123/FHA-Kitchens.

Related papers

ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding [31.481969919049472]
ActionArt is a fine-grained video-caption dataset designed to advance research in human-centric multimodal understanding. Our dataset comprises thousands of videos capturing a broad spectrum of human actions, human-object interactions, and diverse scenarios. We develop eight sub-tasks to evaluate the fine-grained understanding capabilities of existing large multimodal models across different dimensions.
arXiv Detail & Related papers (2025-04-25T08:05:32Z)
Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization [52.87635234206178]
This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization. The framework incorporates two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM)
arXiv Detail & Related papers (2024-08-05T08:35:59Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z)
MSMG-Net: Multi-scale Multi-grained Supervised Metworks for Multi-task Image Manipulation Detection and Localization [1.14219428942199]
A novel multi-scale multi-grained deep network (MSMG-Net) is proposed to automatically identify manipulated regions. In our MSMG-Net, a parallel multi-scale feature extraction structure is used to extract multi-scale features. The MSMG-Net can effectively perceive the object-level semantics and encode the edge artifact.
arXiv Detail & Related papers (2022-11-06T14:58:21Z)
Fine-grained Hand Gesture Recognition in Multi-viewpoint Hand Hygiene [3.588453140011797]
This paper contributes a new high-quality dataset for hand gesture recognition in hand hygiene systems, named "MFH" To address the aforementioned issues, the MFH dataset is proposed to contain a total of 731147 samples obtained by different camera views in 6 non-overlapping locations.
arXiv Detail & Related papers (2021-09-07T08:14:15Z)
Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications. We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class. Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z)
Human Action Recognition Based on Multi-scale Feature Maps from Depth Video Sequences [12.30399970340689]
We present a novel framework focusing on multi-scale motion information to recognize human actions from depth video sequences. We employ depth motion images (DMI) as the templates to generate the multi-scale static representation of actions. We extract the multi-granularity descriptor called LP-DMI-HOG to provide more discriminative features.
arXiv Detail & Related papers (2021-01-19T13:46:42Z)
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos. It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy. This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.