Multi-Granularity Hand Action Detection
- URL: http://arxiv.org/abs/2306.10858v2
- Date: Fri, 9 Aug 2024 18:19:42 GMT
- Title: Multi-Granularity Hand Action Detection
- Authors: Ting Zhe, Jing Zhang, Yongqian Li, Yong Luo, Han Hu, Dacheng Tao,
- Abstract summary: FHA-Kitchens dataset comprises 2,377 video clips and 30,047 frames, annotated with approximately 200k bounding boxes and 880 action categories.
This dataset comprises 2,377 video clips and 30,047 frames, annotated with approximately 200k bounding boxes and 880 action categories.
We propose MG-HAD, an End-to-End Multi-Granularity Hand Action Detection method.
- Score: 58.88274905101276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting hand actions in videos is crucial for understanding video content and has diverse real-world applications. Existing approaches often focus on whole-body actions or coarse-grained action categories, lacking fine-grained hand-action localization information. To fill this gap, we introduce the FHA-Kitchens (Fine-Grained Hand Actions in Kitchen Scenes) dataset, providing both coarse- and fine-grained hand action categories along with localization annotations. This dataset comprises 2,377 video clips and 30,047 frames, annotated with approximately 200k bounding boxes and 880 action categories. Evaluation of existing action detection methods on FHA-Kitchens reveals varying generalization capabilities across different granularities. To handle multi-granularity in hand actions, we propose MG-HAD, an End-to-End Multi-Granularity Hand Action Detection method. It incorporates two new designs: Multi-dimensional Action Queries and Coarse-Fine Contrastive Denoising. Extensive experiments demonstrate MG-HAD's effectiveness for multi-granularity hand action detection, highlighting the significance of FHA-Kitchens for future research and real-world applications. The dataset and source code are available at https://github.com/superZ678/MG-HAD.
Related papers
- ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding [31.481969919049472]
ActionArt is a fine-grained video-caption dataset designed to advance research in human-centric multimodal understanding.
Our dataset comprises thousands of videos capturing a broad spectrum of human actions, human-object interactions, and diverse scenarios.
We develop eight sub-tasks to evaluate the fine-grained understanding capabilities of existing large multimodal models across different dimensions.
arXiv Detail & Related papers (2025-04-25T08:05:32Z) - Mixture-of-Noises Enhanced Forgery-Aware Predictor for Multi-Face Manipulation Detection and Localization [52.87635234206178]
This paper proposes a new framework, namely MoNFAP, specifically tailored for multi-face manipulation detection and localization.
The framework incorporates two novel modules: the Forgery-aware Unified Predictor (FUP) Module and the Mixture-of-Noises Module (MNM)
arXiv Detail & Related papers (2024-08-05T08:35:59Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - MSMG-Net: Multi-scale Multi-grained Supervised Metworks for Multi-task
Image Manipulation Detection and Localization [1.14219428942199]
A novel multi-scale multi-grained deep network (MSMG-Net) is proposed to automatically identify manipulated regions.
In our MSMG-Net, a parallel multi-scale feature extraction structure is used to extract multi-scale features.
The MSMG-Net can effectively perceive the object-level semantics and encode the edge artifact.
arXiv Detail & Related papers (2022-11-06T14:58:21Z) - Fine-grained Hand Gesture Recognition in Multi-viewpoint Hand Hygiene [3.588453140011797]
This paper contributes a new high-quality dataset for hand gesture recognition in hand hygiene systems, named "MFH"
To address the aforementioned issues, the MFH dataset is proposed to contain a total of 731147 samples obtained by different camera views in 6 non-overlapping locations.
arXiv Detail & Related papers (2021-09-07T08:14:15Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - Human Action Recognition Based on Multi-scale Feature Maps from Depth
Video Sequences [12.30399970340689]
We present a novel framework focusing on multi-scale motion information to recognize human actions from depth video sequences.
We employ depth motion images (DMI) as the templates to generate the multi-scale static representation of actions.
We extract the multi-granularity descriptor called LP-DMI-HOG to provide more discriminative features.
arXiv Detail & Related papers (2021-01-19T13:46:42Z) - FineGym: A Hierarchical Video Dataset for Fine-grained Action
Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos.
It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy.
This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.