FingerCap: Fine-grained Finger-level Hand Motion Captioning
- URL: http://arxiv.org/abs/2511.16951v1
- Date: Fri, 21 Nov 2025 04:59:01 GMT
- Title: FingerCap: Fine-grained Finger-level Hand Motion Captioning
- Authors: Xin Shen, Rui Zhu, Lei Shen, Xinyu Wang, Kaihao Zhang, Tianqing Zhu, Shuchen Wu, Chenxi Miao, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang, Xin Yu,
- Abstract summary: Fine-grained Finger-level Hand Motion Captioning aims to generate detailed finger-level semantics of hand actions.<n>To support this task, we curate FingerCap-40K, a large-scale corpus of 40K paired hand-motion videos and captions.<n>Experiments on FingerCap-40K show that strong open- and closed-source Video-MLLMs still struggle with finger-level reasoning.
- Score: 44.18347733095312
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding fine-grained human hand motion is fundamental to visual perception, embodied intelligence, and multimodal communication. In this work, we propose Fine-grained Finger-level Hand Motion Captioning (FingerCap), which aims to generate textual descriptions that capture detailed finger-level semantics of hand actions. To support this task, we curate FingerCap-40K, a large-scale corpus of 40K paired hand-motion videos and captions spanning two complementary sources: concise instruction-style finger motions and diverse, naturalistic hand-object interactions. To enable effective evaluation, we employ HandJudge, a LLM-based rubric that measures finger-level correctness and motion completeness. Temporal sparsity remains a fundamental bottleneck for current Video-MLLMs, since sparse RGB sampling is insufficient to capture the subtle, high-frequency dynamics underlying fine finger motions. As a simple and compute-friendly remedy, we introduce FiGOP (Finger Group-of-Pictures), which pairs each RGB keyframe with subsequent hand keypoints until the next keyframe. A lightweight temporal encoder converts the keypoints into motion embeddings and integrates them with RGB features. FiGOP adapts the classic GOP concept to finger motion, recovering fine temporal cues without increasing RGB density. Experiments on FingerCap-40K show that strong open- and closed-source Video-MLLMs still struggle with finger-level reasoning, while our FiGOP-augmented model yield consistent gains under HandJudge and human studies.
Related papers
- FUSION: Full-Body Unified Motion Prior for Body and Hands via Diffusion [49.026972478098266]
Hands are central to interacting with our surroundings and conveying gestures.<n>Existing human motion synthesis methods fall short.<n>Key obstacle is the lack of large-scale datasets that jointly capture diverse full-body motion.
arXiv Detail & Related papers (2026-01-07T14:18:59Z) - OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction [93.88239833545623]
We present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset.<n>We show that tactile signals provide a compact yet powerful cue for grasp understanding.<n>We aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.
arXiv Detail & Related papers (2025-12-18T18:18:17Z) - MILE: A Mechanically Isomorphic Exoskeleton Data Collection System with Fingertip Visuotactile Sensing for Dexterous Manipulation [17.138615434309575]
Existing data-collection pipelines suffer from inaccurate motion manipulation, low data-collection efficiency, and missing high-resolution tactile sensing.<n>We address this gap with MILE, a mechanically tele-operation and data-collection system co-designed from human hand to robotic hand.
arXiv Detail & Related papers (2025-11-29T05:34:39Z) - HandReader: Advanced Techniques for Efficient Fingerspelling Recognition [75.38606213726906]
This paper introduces HandReader, a group of three architectures designed to address the fingerspelling recognition task.<n>HandReader$_RGB$ employs the novel Adaptive Shift-Temporal Module (TSAM) to process RGB features from videos of varying lengths.<n>HandReader$_KP$ is built on the proposed Temporal Pose (TPE) operated on keypoints as tensors.<n>Each HandReader model possesses distinct advantages and achieves state-of-the-art results on the ChicagoFSWild and ChicagoFSWild+ datasets.
arXiv Detail & Related papers (2025-05-15T13:18:37Z) - Body-Hand Modality Expertized Networks with Cross-attention for Fine-grained Skeleton Action Recognition [28.174638880324014]
BHaRNet is a novel framework that augments a typical body-expert model with a hand-expert model.<n>Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization.<n>Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information.
arXiv Detail & Related papers (2025-03-19T07:54:52Z) - Expressive Gaussian Human Avatars from Monocular RGB Video [69.56388194249942]
We introduce EVA, a drivable human model that meticulously sculpts fine details based on 3D Gaussians and SMPL-X.
We highlight the critical importance of aligning the SMPL-X model with RGB frames for effective avatar learning.
We propose a context-aware adaptive density control strategy, which is adaptively adjusting the gradient thresholds.
arXiv Detail & Related papers (2024-07-03T15:36:27Z) - Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation [59.3035531612715]
Existing methods often struggle to generate plausible hand poses when the hand is heavily occluded or blurred.
In videos, the movements of the hand allow us to observe various parts of the hand that may be occluded or blurred in a single frame.
We propose the Deformer: a framework that implicitly reasons about the relationship between hand parts within the same image.
arXiv Detail & Related papers (2023-03-09T02:24:30Z) - On-device Real-time Hand Gesture Recognition [1.4658400971135652]
We present an on-device real-time hand gesture recognition (HGR) system, which detects a set of predefined static gestures from a single RGB camera.
We use MediaPipe Hands as the basis of the hand skeleton tracker, improve the keypoint accuracy, and add the estimation of 3D keypoints in a world metric space.
arXiv Detail & Related papers (2021-10-29T18:33:25Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - A deep-learning--based multimodal depth-aware dynamic hand gesture
recognition system [5.458813674116228]
We focus on dynamic hand gesture (DHG) recognition using depth quantized image hand skeleton joint points.
In particular, we explore the effect of using depth-quantized features in CNN and Recurrent Neural Network (RNN) based multi-modal fusion networks.
arXiv Detail & Related papers (2021-07-06T11:18:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.