Towards Weather-Robust 3D Human Body Reconstruction: Millimeter-Wave Radar-Based Dataset, Benchmark, and Multi-Modal Fusion
- URL: http://arxiv.org/abs/2409.04851v2
- Date: Wed, 18 Dec 2024 03:40:35 GMT
- Title: Towards Weather-Robust 3D Human Body Reconstruction: Millimeter-Wave Radar-Based Dataset, Benchmark, and Multi-Modal Fusion
- Authors: Anjun Chen, Xiangyu Wang, Kun Shi, Yuchi Huo, Jiming Chen, Qi Ye,
- Abstract summary: 3D human reconstruction from RGB images achieves decent results in good weather conditions but degrades dramatically in rough weather.<n>mmWave radars have been employed to reconstruct 3D human joints and meshes in rough weather.<n>We design ImmFusion, the first mmWave-RGB fusion solution to robustly reconstruct 3D human bodies in various weather conditions.
- Score: 13.082760040398147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D human reconstruction from RGB images achieves decent results in good weather conditions but degrades dramatically in rough weather. Complementarily, mmWave radars have been employed to reconstruct 3D human joints and meshes in rough weather. However, combining RGB and mmWave signals for weather-robust 3D human reconstruction is still an open challenge, given the sparse nature of mmWave and the vulnerability of RGB images. The limited research about the impact of missing points and sparsity features of mmWave data on reconstruction performance, as well as the lack of available datasets for paired mmWave-RGB data, further complicates the process of fusing the two modalities. To fill these gaps, we build up an automatic 3D body annotation system with multiple sensors to collect a large-scale mmWave dataset. The dataset consists of synchronized and calibrated mmWave radar point clouds and RGB(D) images under different weather conditions and skeleton/mesh annotations for humans in these scenes. With this dataset, we conduct a comprehensive analysis about the limitations of single-modality reconstruction and the impact of missing points and sparsity on the reconstruction performance. Based on the guidance of this analysis, we design ImmFusion, the first mmWave-RGB fusion solution to robustly reconstruct 3D human bodies in various weather conditions. Specifically, our ImmFusion consists of image and point backbones for token feature extraction and a Transformer module for token fusion. The image and point backbones refine global and local features from original data, and the Fusion Transformer Module aims for effective information fusion of two modalities by dynamically selecting informative tokens. Extensive experiments demonstrate that ImmFusion can efficiently utilize the information of two modalities to achieve robust 3D human body reconstruction in various weather environments.
Related papers
- Towards Robust Multimodal Physiological Foundation Models: Handling Arbitrary Missing Modalities [9.785262633953794]
Physio Omni is a foundation model for multimodal physiological signal analysis.
It trains a decoupled multimodal tokenizer, enabling masked signal pre-training.
It achieves state-of-the-art performance while maintaining strong robustness to missing modalities.
arXiv Detail & Related papers (2025-04-28T09:00:04Z) - LM-MCVT: A Lightweight Multi-modal Multi-view Convolutional-Vision Transformer Approach for 3D Object Recognition [5.317624228510749]
We propose a novel Lightweight Multi-modal Multi-view Convolutional-Vision Transformer network (LM-MCVT) to enhance 3D object recognition in robotic applications.
We evaluate our method on the synthetic ModelNet40 dataset and achieve a recognition accuracy of 95.6%.
Results consistently show superior performance, demonstrating the method's robustness in 3D object recognition across synthetic and real-world 3D data.
arXiv Detail & Related papers (2025-04-27T14:30:16Z) - Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes [56.52618054240197]
We propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes.
Our method, CAFuser, uses an RGB camera input to classify environmental conditions and generate a Condition Token that guides the fusion of multiple sensor modalities.
We set the new state of the art with CAFuser on the MUSES dataset with 59.7 PQ for multimodal panoptic segmentation and 78.2 mIoU for semantic segmentation, ranking first on the public benchmarks.
arXiv Detail & Related papers (2024-10-14T17:56:20Z) - X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing [14.549639729808717]
Current human sensing primarily depends on cameras and LiDAR, each of which has its own strengths and limitations.
Existing multi-modal fusion solutions are typically designed for fixed modality combinations.
We propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue.
arXiv Detail & Related papers (2024-10-14T05:23:12Z) - Progressive Multi-Modal Fusion for Robust 3D Object Detection [12.048303829428452]
Existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV)
We propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels.
Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection.
arXiv Detail & Related papers (2024-10-09T22:57:47Z) - Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection [38.809645060899065]
Camera and LiDAR serve as informative sensors for accurate and robust autonomous driving systems.
These sensors often exhibit heterogeneous natures, resulting in distributional modality gaps.
We introduce a dynamic adjustment technology aimed at aligning modal distributions and learning effective modality representations.
arXiv Detail & Related papers (2024-07-22T02:42:15Z) - SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation [74.07836010698801]
We propose an SMPL-based Transformer framework (SMPLer) to address this issue.
SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation.
Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods.
arXiv Detail & Related papers (2024-04-23T17:59:59Z) - FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal
Consistent Transformer for 3D Object Detection [14.457844173630667]
We propose a novel end-to-end multi-modal fusion transformer-based framework, dubbed FusionFormer.
By developing a uniform sampling strategy, our method can easily sample from 2D image and 3D voxel features spontaneously.
Our method achieves state-of-the-art single model performance of 72.6% mAP and 75.1% NDS in the 3D object detection task without test time augmentation.
arXiv Detail & Related papers (2023-09-11T06:27:25Z) - Learning Modulated Transformation in GANs [69.95217723100413]
We equip the generator in generative adversarial networks (GANs) with a plug-and-play module, termed as modulated transformation module (MTM)
MTM predicts spatial offsets under the control of latent codes, based on which the convolution operation can be applied at variable locations.
It is noteworthy that towards human generation on the challenging TaiChi dataset, we improve the FID of StyleGAN3 from 21.36 to 13.60, demonstrating the efficacy of learning modulated geometry transformation.
arXiv Detail & Related papers (2023-08-29T17:51:22Z) - Equivariant Multi-Modality Image Fusion [124.11300001864579]
We propose the Equivariant Multi-Modality imAge fusion paradigm for end-to-end self-supervised learning.
Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations.
Experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images.
arXiv Detail & Related papers (2023-05-19T05:50:24Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z) - Learning Online Multi-Sensor Depth Fusion [100.84519175539378]
SenFuNet is a depth fusion approach that learns sensor-specific noise and outlier statistics.
We conduct experiments with various sensor combinations on the real-world CoRBS and Scene3D datasets.
arXiv Detail & Related papers (2022-04-07T10:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.