Fusing Monocular Images and Sparse IMU Signals for Real-time Human
Motion Capture
- URL: http://arxiv.org/abs/2309.00310v1
- Date: Fri, 1 Sep 2023 07:52:08 GMT
- Title: Fusing Monocular Images and Sparse IMU Signals for Real-time Human
Motion Capture
- Authors: Shaohua Pan, Qi Ma, Xinyu Yi, Weifeng Hu, Xiong Wang, Xingkang Zhou,
Jijunnan Li, and Feng Xu
- Abstract summary: We propose a method that fuses monocular images and sparse IMUs for real-time human motion capture.
Our method contains a dual coordinate strategy to fully explore the IMU signals with different goals in motion capture.
Our technique significantly outperforms the state-of-the-art vision, IMU, and combined methods on both global orientation and local pose estimation.
- Score: 8.125716139367142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Either RGB images or inertial signals have been used for the task of motion
capture (mocap), but combining them together is a new and interesting topic. We
believe that the combination is complementary and able to solve the inherent
difficulties of using one modality input, including occlusions, extreme
lighting/texture, and out-of-view for visual mocap and global drifts for
inertial mocap. To this end, we propose a method that fuses monocular images
and sparse IMUs for real-time human motion capture. Our method contains a dual
coordinate strategy to fully explore the IMU signals with different goals in
motion capture. To be specific, besides one branch transforming the IMU signals
to the camera coordinate system to combine with the image information, there is
another branch to learn from the IMU signals in the body root coordinate system
to better estimate body poses. Furthermore, a hidden state feedback mechanism
is proposed for both two branches to compensate for their own drawbacks in
extreme input cases. Thus our method can easily switch between the two kinds of
signals or combine them in different cases to achieve a robust mocap. %The two
divided parts can help each other for better mocap results under different
conditions. Quantitative and qualitative results demonstrate that by delicately
designing the fusion method, our technique significantly outperforms the
state-of-the-art vision, IMU, and combined methods on both global orientation
and local pose estimation. Our codes are available for research at
https://shaohua-pan.github.io/robustcap-page/.
Related papers
- Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition [24.217068565936117]
We present a novel method for action recognition that integrates motion data from body-worn IMUs with egocentric video.
To model the complex relation of multiple IMU devices placed across the body, we exploit the collaborative dynamics in multiple IMU devices.
Experiments show our method can achieve state-of-the-art performance on multiple public datasets.
arXiv Detail & Related papers (2024-07-09T07:53:16Z) - Fusion Transformer with Object Mask Guidance for Image Forgery Analysis [9.468075384561947]
We introduce OMG-Fuser, a fusion transformer-based network designed to extract information from various forensic signals.
Our approach can operate with an arbitrary number of forensic signals and leverages object information for their analysis.
Our model is robust against traditional and novel forgery attacks and can be expanded with new signals without training from scratch.
arXiv Detail & Related papers (2024-03-18T20:20:13Z) - Neuromorphic Synergy for Video Binarization [54.195375576583864]
Bimodal objects serve as a visual form to embed information that can be easily recognized by vision systems.
Neuromorphic cameras offer new capabilities for alleviating motion blur, but it is non-trivial to first de-blur and then binarize the images in a real-time manner.
We propose an event-based binary reconstruction method that leverages the prior knowledge of the bimodal target's properties to perform inference independently in both event space and image space.
We also develop an efficient integration method to propagate this binary image to high frame rate binary video.
arXiv Detail & Related papers (2024-02-20T01:43:51Z) - EgoLocate: Real-time Motion Capture, Localization, and Mapping with
Sparse Body-mounted Sensors [74.1275051763006]
We develop a system that simultaneously performs human motion capture (mocap), localization, and mapping in real time from sparse body-mounted sensors.
Our technique is largely improved by our technique, compared with the state of the art of the two fields.
arXiv Detail & Related papers (2023-05-02T16:56:53Z) - DeepMLE: A Robust Deep Maximum Likelihood Estimator for Two-view
Structure from Motion [9.294501649791016]
Two-view structure from motion (SfM) is the cornerstone of 3D reconstruction and visual SLAM (vSLAM)
We formulate the two-view SfM problem as a maximum likelihood estimation (MLE) and solve it with the proposed framework, denoted as DeepMLE.
Our method significantly outperforms the state-of-the-art end-to-end two-view SfM approaches in accuracy and generalization capability.
arXiv Detail & Related papers (2022-10-11T15:07:25Z) - Animation from Blur: Multi-modal Blur Decomposition with Motion Guidance [83.25826307000717]
We study the challenging problem of recovering detailed motion from a single motion-red image.
Existing solutions to this problem estimate a single image sequence without considering the motion ambiguity for each region.
In this paper, we explicitly account for such motion ambiguity, allowing us to generate multiple plausible solutions all in sharp detail.
arXiv Detail & Related papers (2022-07-20T18:05:53Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z) - AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in
the Wild [77.43884383743872]
We present AdaFuse, an adaptive multiview fusion method to enhance the features in occluded views.
We extensively evaluate the approach on three public datasets including Human3.6M, Total Capture and CMU Panoptic.
We also create a large scale synthetic dataset Occlusion-Person, which allows us to perform numerical evaluation on the occluded joints.
arXiv Detail & Related papers (2020-10-26T03:19:46Z) - Dual Attention GANs for Semantic Image Synthesis [101.36015877815537]
We propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images.
We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM)
DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters.
arXiv Detail & Related papers (2020-08-29T17:49:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.