MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image
- URL: http://arxiv.org/abs/2112.02753v1
- Date: Mon, 6 Dec 2021 03:01:24 GMT
- Title: MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image
- Authors: Xingyu Chen, Yufeng Liu, Yajiao Dong, Xiong Zhang, Chongyang Ma,
Yanmin Xiong, Yuan Zhang, and Xiaoyan Guo
- Abstract summary: We propose a framework for single-view hand mesh reconstruction, which can simultaneously achieve high reconstruction accuracy, fast inference speed, and temporal coherence.
Our framework, called MobRecon, comprises affordable computational costs and miniature model size, which reaches a high inference speed of 83FPS on Apple A14 CPU.
- Score: 18.68544438724187
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work, we propose a framework for single-view hand mesh
reconstruction, which can simultaneously achieve high reconstruction accuracy,
fast inference speed, and temporal coherence. Specifically, for 2D encoding, we
propose lightweight yet effective stacked structures. Regarding 3D decoding, we
provide an efficient graph operator, namely depth-separable spiral convolution.
Moreover, we present a novel feature lifting module for bridging the gap
between 2D and 3D representations. This module starts with a map-based position
regression (MapReg) block to integrate the merits of both heatmap encoding and
position regression paradigms to improve 2D accuracy and temporal coherence.
Furthermore, MapReg is followed by pose pooling and pose-to-vertex lifting
approaches, which transform 2D pose encodings to semantic features of 3D
vertices. Overall, our hand reconstruction framework, called MobRecon,
comprises affordable computational costs and miniature model size, which
reaches a high inference speed of 83FPS on Apple A14 CPU. Extensive experiments
on popular datasets such as FreiHAND, RHD, and HO3Dv2 demonstrate that our
MobRecon achieves superior performance on reconstruction accuracy and temporal
coherence. Our code is publicly available at
https://github.com/SeanChenxy/HandMesh.
Related papers
- GSFusion: Online RGB-D Mapping Where Gaussian Splatting Meets TSDF Fusion [12.964675001994124]
Traditional fusion algorithms preserve the spatial structure of 3D scenes.
They often lack realism in terms of visualization.
GSFusion significantly enhances computational efficiency without sacrificing rendering quality.
arXiv Detail & Related papers (2024-08-22T18:32:50Z) - SfM on-the-fly: Get better 3D from What You Capture [24.141351494527303]
Structure from Motion (SfM) has been a constant research hotspot in the fields of photogrammetry, computer vision, robotics etc.
This work builds upon the original on-the-fly SfM and presents an updated version with three new advancements to get better 3D from what you capture.
arXiv Detail & Related papers (2024-07-04T13:52:37Z) - Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction [153.52406455209538]
Gamba is an end-to-end 3D reconstruction model from a single-view image.
It completes reconstruction within 0.05 seconds on a single NVIDIA A100 GPU.
arXiv Detail & Related papers (2024-03-27T17:40:14Z) - Splatter Image: Ultra-Fast Single-View 3D Reconstruction [67.96212093828179]
Splatter Image is based on Gaussian Splatting, which allows fast and high-quality reconstruction of 3D scenes from multiple images.
We learn a neural network that, at test time, performs reconstruction in a feed-forward manner, at 38 FPS.
On several synthetic, real, multi-category and large-scale benchmark datasets, we achieve better results in terms of PSNR, LPIPS, and other metrics while training and evaluating much faster than prior works.
arXiv Detail & Related papers (2023-12-20T16:14:58Z) - Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D
Reconstruction with Transformers [37.14235383028582]
We introduce a novel approach for single-view reconstruction that efficiently generates a 3D model from a single image via feed-forward inference.
Our method utilizes two transformer-based networks, namely a point decoder and a triplane decoder, to reconstruct 3D objects using a hybrid Triplane-Gaussian intermediate representation.
arXiv Detail & Related papers (2023-12-14T17:18:34Z) - SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and
Quasi-Planar Segmentation [53.83313235792596]
We present a new methodology for real-time semantic mapping from RGB-D sequences.
It combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping.
Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems.
arXiv Detail & Related papers (2023-06-28T22:36:44Z) - XFormer: Fast and Accurate Monocular 3D Body Capture [29.36334648136584]
We present XFormer, a novel human mesh and motion capture method that achieves real-time performance on consumer CPUs given only monocular images as input.
XFormer runs blazing fast (over 30 fps on a single CPU core) and still yields competitive accuracy.
With an HRNet backbone, XFormer delivers state-of-the-art performance on Huamn3.6 and 3DPW datasets.
arXiv Detail & Related papers (2023-05-18T16:45:26Z) - CheckerPose: Progressive Dense Keypoint Localization for Object Pose
Estimation with Graph Neural Network [66.24726878647543]
Estimating the 6-DoF pose of a rigid object from a single RGB image is a crucial yet challenging task.
Recent studies have shown the great potential of dense correspondence-based solutions.
We propose a novel pose estimation algorithm named CheckerPose, which improves on three main aspects.
arXiv Detail & Related papers (2023-03-29T17:30:53Z) - Multi-initialization Optimization Network for Accurate 3D Human Pose and
Shape Estimation [75.44912541912252]
We propose a three-stage framework named Multi-Initialization Optimization Network (MION)
In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample.
In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism.
Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction.
arXiv Detail & Related papers (2021-12-24T02:43:58Z) - Improved Modeling of 3D Shapes with Multi-view Depth Maps [48.8309897766904]
We present a general-purpose framework for modeling 3D shapes using CNNs.
Using just a single depth image of the object, we can output a dense multi-view depth map representation of 3D objects.
arXiv Detail & Related papers (2020-09-07T17:58:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.