Related papers: DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions

DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions

URL: http://arxiv.org/abs/2512.02727v1
Date: Tue, 02 Dec 2025 13:01:04 GMT
Title: DF-Mamba: Deformable State Space Modeling for 3D Hand Pose Estimation in Interactions
Authors: Yifan Zhou, Takehiko Ohkawa, Guwenxiao Zhou, Kanoko Goto, Takumi Hirose, Yusuke Sekikawa, Nakamasa Inoue,
Abstract summary: Deformable Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning.<n>Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation.<n>DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.
Score: 30.46272629126919
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modeling daily hand interactions often struggles with severe occlusions, such as when two hands overlap, which highlights the need for robust feature learning in 3D hand pose estimation (HPE). To handle such occluded hand images, it is vital to effectively learn the relationship between local image features (e.g., for occluded joints) and global context (e.g., cues from inter-joints, inter-hands, or the scene). However, most current 3D HPE methods still rely on ResNet for feature extraction, and such CNN's inductive bias may not be optimal for 3D HPE due to its limited capability to model the global context. To address this limitation, we propose an effective and efficient framework for visual feature extraction in 3D HPE using recent state space modeling (i.e., Mamba), dubbed Deformable Mamba (DF-Mamba). DF-Mamba is designed to capture global context cues beyond standard convolution through Mamba's selective state modeling and the proposed deformable state scanning. Specifically, for local features after convolution, our deformable scanning aggregates these features within an image while selectively preserving useful cues that represent the global context. This approach significantly improves the accuracy of structured 3D HPE, with comparable inference speed to ResNet-50. Our experiments involve extensive evaluations on five divergent datasets including single-hand and two-hand scenarios, hand-only and hand-object interactions, as well as RGB and depth-based estimation. DF-Mamba outperforms the latest image backbones, including VMamba and Spatial-Mamba, on all datasets and achieves state-of-the-art performance.

Related papers

HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation [60.2305990057581]
3D hand pose estimation is crucial for many human-computer interaction applications such as augmented reality.<n>HandMCM is a novel method based on the powerful state space model (Mamba)
arXiv Detail & Related papers (2026-02-02T03:25:43Z)
One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation [32.45730375971019]
Estimating the 6D pose of arbitrary unseen objects from a single reference image is critical for robotics operating in the long-tail of real-world instances.<n>We propose OnePoseViaGen, a pipeline that tackles these challenges through two key components.<n>We demonstrate robust dexterous grasping with a real robot hand, validating the practicality of our method in real-world manipulation.
arXiv Detail & Related papers (2025-09-09T17:59:02Z)
VRS-UIE: Value-Driven Reordering Scanning for Underwater Image Enhancement [104.78586859995333]
State Space Models (SSMs) have emerged as a promising backbone for vision tasks due to their linear complexity and global receptive field.<n>The predominance of large-portion, homogeneous but useless oceanic backgrounds can dilute the feature representation responses of sparse yet valuable targets.<n>We propose a novel Value-Driven Reordering Scanning framework for Underwater Image Enhancement (UIE)<n>Our framework sets a new state-of-the-art, delivering superior enhancement performance (surpassing WMamba by 0.89 dB on average) by effectively suppressing water bias and preserving structural and color fidelity.
arXiv Detail & Related papers (2025-05-02T12:21:44Z)
HGMamba: Enhancing 3D Human Pose Estimation with a HyperGCN-Mamba Network [0.0]
3D human pose is a promising research area that leverages estimated and ground-truth 2D human pose data for training.<n>Existing approaches aim to enhance the performance of estimated 2D poses, but struggle when applied to ground-truth 2D pose data.<n>We propose a novel Hyper-GCN and Shuffle Mamba block, which processes input data through two parallel streams.
arXiv Detail & Related papers (2025-04-09T07:28:19Z)
WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild [53.288327629960364]
We present a data-driven pipeline for efficient multi-hand reconstruction in the wild.<n>The proposed pipeline is composed of two components: a real-time fully convolutional hand localization and a high-fidelity transformer-based 3D hand reconstruction model.<n>Our approach outperforms previous methods in both efficiency and accuracy on popular 2D and 3D benchmarks.
arXiv Detail & Related papers (2024-09-18T18:46:51Z)
UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues [55.69339788566899]
UPose3D is a novel approach for multi-view 3D human pose estimation. It improves robustness and flexibility without requiring direct 3D annotations.
arXiv Detail & Related papers (2024-04-23T00:18:00Z)
HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions [68.28684509445529]
We present HandBooster, a new approach to uplift the data diversity and boost the 3D hand-mesh reconstruction performance. First, we construct versatile content-aware conditions to guide a diffusion model to produce realistic images with diverse hand appearances, poses, views, and backgrounds. Then, we design a novel condition creator based on our similarity-aware distribution sampling strategies to deliberately find novel and realistic interaction poses that are distinctive from the training set.
arXiv Detail & Related papers (2024-03-27T13:56:08Z)
Denoising Diffusion for 3D Hand Pose Estimation from Images [38.20064386142944]
This paper addresses the problem of 3D hand pose estimation from monocular images or sequences. We present a novel end-to-end framework for 3D hand regression that employs diffusion models that have shown excellent ability to capture the distribution of data for generative purposes. The proposed model provides state-of-the-art performance when lifting a 2D single-hand image to 3D.
arXiv Detail & Related papers (2023-08-18T12:57:22Z)
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation [72.94143731623117]
Existing methods simply align 3D representations with single-view 2D images and coarse-grained parent category text. Insufficient synergy neglects the idea that a robust 3D representation should align with the joint vision-language space. We propose a multi-view joint modality modeling approach, termed JM3D, to obtain a unified representation for point cloud, text, and image.
arXiv Detail & Related papers (2023-08-06T01:11:40Z)
HandFoldingNet: A 3D Hand Pose Estimation Network Using Multiscale-Feature Guided Folding of a 2D Hand Skeleton [4.1954750695245835]
This paper proposes HandFoldingNet, an accurate and efficient hand pose estimator. The proposed model utilizes a folding-based decoder that folds a given 2D hand skeleton into the corresponding joint coordinates. Experimental results show that the proposed model outperforms the existing methods on three hand pose benchmark datasets.
arXiv Detail & Related papers (2021-08-12T05:52:44Z)
Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image. The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images. We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.