LWA-HAND: Lightweight Attention Hand for Interacting Hand Reconstruction
- URL: http://arxiv.org/abs/2208.09815v2
- Date: Tue, 23 Aug 2022 03:54:47 GMT
- Title: LWA-HAND: Lightweight Attention Hand for Interacting Hand Reconstruction
- Authors: Xinhan Di, Pengqian Yu
- Abstract summary: We propose a method called lightweight attention hand (LWA-HAND) to reconstruct hands in low flops from a single RGB image.
The resulting model achieves comparable performance on the InterHand2.6M benchmark in comparison with the state-of-the-art models.
- Score: 2.2481284426718533
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hand reconstruction has achieved great success in real-time applications such
as visual reality and augmented reality while interacting with two-hand
reconstruction through efficient transformers is left unexplored. In this
paper, we propose a method called lightweight attention hand (LWA-HAND) to
reconstruct hands in low flops from a single RGB image. To solve the occlusion
and interaction challenges in efficient attention architectures, we introduce
three mobile attention modules. The first module is a lightweight feature
attention module that extracts both local occlusion representation and global
image patch representation in a coarse-to-fine manner. The second module is a
cross image and graph bridge module which fuses image context and hand vertex.
The third module is a lightweight cross-attention mechanism that uses
element-wise operation for cross attention of two hands in linear complexity.
The resulting model achieves comparable performance on the InterHand2.6M
benchmark in comparison with the state-of-the-art models. Simultaneously, it
reduces the flops to $0.47GFlops$ while the state-of-the-art models have heavy
computations between $10GFlops$ and $20GFlops$.
Related papers
- VM-BHINet:Vision Mamba Bimanual Hand Interaction Network for 3D Interacting Hand Mesh Recovery From a Single RGB Image [13.009696075460521]
Vision Mamba Bimanual Hand Interaction Network (VM-BHINet) introduces state space models (SSMs) into hand reconstruction to enhance interaction modeling.
The core component, Vision Mamba Interaction Feature Extraction Block (VM-IFEBlock), combines SSMs with local and global feature operations.
Experiments on the InterHand2.6M dataset show that VM-BHINet reduces Mean per-joint position error (MPJPE) and Mean per-vertex position error (MPVPE) by 2-3%.
arXiv Detail & Related papers (2025-04-20T13:54:22Z) - FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.
We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF)
PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models.
FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z) - Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba [48.45301469664908]
3D Hand reconstruction from a single RGB image is challenging due to the articulated motion, self-occlusion, and interaction with objects.
Existing SOTA methods employ attention-based transformers to learn the 3D hand pose and shape.
We propose a novel graph-guided Mamba framework, named Hamba, which bridges graph learning and state space modeling.
arXiv Detail & Related papers (2024-07-12T19:04:58Z) - 3D Pose Estimation of Two Interacting Hands from a Monocular Event
Camera [59.846927201816776]
This paper introduces the first framework for 3D tracking of two fast-moving and interacting hands from a single monocular event camera.
Our approach tackles the left-right hand ambiguity with a novel semi-supervised feature-wise attention mechanism and integrates an intersection loss to fix hand collisions.
arXiv Detail & Related papers (2023-12-21T18:59:57Z) - Mutual Information-driven Triple Interaction Network for Efficient Image
Dehazing [54.168567276280505]
We propose a novel Mutual Information-driven Triple interaction Network (MITNet) for image dehazing.
The first stage, named amplitude-guided haze removal, aims to recover the amplitude spectrum of the hazy images for haze removal.
The second stage, named phase-guided structure refined, devotes to learning the transformation and refinement of the phase spectrum.
arXiv Detail & Related papers (2023-08-14T08:23:58Z) - Im2Hands: Learning Attentive Implicit Representation of Interacting
Two-Hand Shapes [58.551154822792284]
Implicit Two Hands (Im2Hands) is the first neural implicit representation of two interacting hands.
Im2Hands can produce fine-grained geometry of two hands with high hand-to-hand and hand-to-image coherency.
We experimentally demonstrate the effectiveness of Im2Hands on two-hand reconstruction in comparison to related methods.
arXiv Detail & Related papers (2023-02-28T06:38:25Z) - Decoupled Iterative Refinement Framework for Interacting Hands
Reconstruction from a Single RGB Image [30.24438569170251]
We propose a decoupled iterative refinement framework to achieve pixel-alignment hand reconstruction.
Our method outperforms all existing two-hand reconstruction methods by a large margin on the InterHand2.6M dataset.
arXiv Detail & Related papers (2023-02-05T15:46:57Z) - Interacting Attention Graph for Single Image Two-Hand Reconstruction [32.342152070402236]
We present Interacting Attention Graph Hand (IntagHand), the first graph convolution based network that reconstructs two interacting hands from a single RGB image.
Our model outperforms all existing two-hand reconstruction methods by a large margin on InterHand2.6M benchmark.
arXiv Detail & Related papers (2022-03-17T14:51:11Z) - MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image [18.68544438724187]
We propose a framework for single-view hand mesh reconstruction, which can simultaneously achieve high reconstruction accuracy, fast inference speed, and temporal coherence.
Our framework, called MobRecon, comprises affordable computational costs and miniature model size, which reaches a high inference speed of 83FPS on Apple A14 CPU.
arXiv Detail & Related papers (2021-12-06T03:01:24Z) - Monocular 3D Reconstruction of Interacting Hands via Collision-Aware
Factorized Refinements [96.40125818594952]
We make the first attempt to reconstruct 3D interacting hands from monocular single RGB images.
Our method can generate 3D hand meshes with both precise 3D poses and minimal collisions.
arXiv Detail & Related papers (2021-11-01T08:24:10Z) - Real-time Pose and Shape Reconstruction of Two Interacting Hands With a
Single Depth Camera [79.41374930171469]
We present a novel method for real-time pose and shape reconstruction of two strongly interacting hands.
Our approach combines an extensive list of favorable properties, namely it is marker-less.
We show state-of-the-art results in scenes that exceed the complexity level demonstrated by previous work.
arXiv Detail & Related papers (2021-06-15T11:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.