A Multimodal Sensor Fusion Framework Robust to Missing Modalities for
Person Recognition
- URL: http://arxiv.org/abs/2210.10972v2
- Date: Sat, 22 Oct 2022 04:51:51 GMT
- Title: A Multimodal Sensor Fusion Framework Robust to Missing Modalities for
Person Recognition
- Authors: Vijay John and Yasutomo Kawanishi
- Abstract summary: We propose a novel trimodal sensor fusion framework using the audio, visible, and thermal camera.
A novel deep latent embedding framework, termed the AVTNet, is proposed to learn multiple latent embeddings.
A comparative analysis with baseline algorithms shows that the proposed framework significantly increases the person recognition accuracy.
- Score: 2.436681150766912
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Utilizing the sensor characteristics of the audio, visible camera, and
thermal camera, the robustness of person recognition can be enhanced. Existing
multimodal person recognition frameworks are primarily formulated assuming that
multimodal data is always available. In this paper, we propose a novel trimodal
sensor fusion framework using the audio, visible, and thermal camera, which
addresses the missing modality problem. In the framework, a novel deep latent
embedding framework, termed the AVTNet, is proposed to learn multiple latent
embeddings. Also, a novel loss function, termed missing modality loss, accounts
for possible missing modalities based on the triplet loss calculation while
learning the individual latent embeddings. Additionally, a joint latent
embedding utilizing the trimodal data is learnt using the multi-head attention
transformer, which assigns attention weights to the different modalities. The
different latent embeddings are subsequently used to train a deep neural
network. The proposed framework is validated on the Speaking Faces dataset. A
comparative analysis with baseline algorithms shows that the proposed framework
significantly increases the person recognition accuracy while accounting for
missing modalities.
Related papers
- GSPR: Multimodal Place Recognition Using 3D Gaussian Splatting for Autonomous Driving [9.023864430027333]
multimodal place recognition has gained increasing attention due to their ability to overcome weaknesses of uni sensor systems.
We propose a 3D Gaussian-based multimodal place recognition neural network dubbed GSPR.
arXiv Detail & Related papers (2024-10-01T00:43:45Z) - Robust Multimodal 3D Object Detection via Modality-Agnostic Decoding and Proximity-based Modality Ensemble [15.173314907900842]
Existing 3D object detection methods rely heavily on the LiDAR sensor.
We propose MEFormer to address the LiDAR over-reliance problem.
Our MEFormer achieves state-of-the-art performance of 73.9% NDS and 71.5% mAP in the nuScenes validation set.
arXiv Detail & Related papers (2024-07-27T03:21:44Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - Fully Differentiable Correlation-driven 2D/3D Registration for X-ray to CT Image Fusion [3.868072865207522]
Image-based rigid 2D/3D registration is a critical technique for fluoroscopic guided surgical interventions.
We propose a novel fully differentiable correlation-driven network using a dual-branch CNN-transformer encoder.
A correlation-driven loss is proposed for low-frequency feature and high-frequency feature decomposition based on embedded information.
arXiv Detail & Related papers (2024-02-04T14:12:51Z) - Multi-scale Semantic Correlation Mining for Visible-Infrared Person
Re-Identification [19.49945790485511]
MSCMNet is proposed to comprehensively exploit semantic features at multiple scales.
It simultaneously reduces modality information loss as small as possible in feature extraction.
Extensive experiments on the SYSU-MM01, RegDB, and LLCM datasets demonstrate that the proposed MSCMNet achieves the greatest accuracy.
arXiv Detail & Related papers (2023-11-24T10:23:57Z) - mmFUSION: Multimodal Fusion for 3D Objects Detection [18.401155770778757]
Multi-sensor fusion is essential for accurate 3D object detection in self-driving systems.
In this paper, we propose a new intermediate-level multi-modal fusion approach to overcome these challenges.
The code with the mmdetection3D project plugin will be publicly available soon.
arXiv Detail & Related papers (2023-11-07T15:11:27Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Inertial Hallucinations -- When Wearable Inertial Devices Start Seeing
Things [82.15959827765325]
We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL)
We address two major shortcomings of standard multimodal approaches, limited area coverage and reduced reliability.
Our new framework fuses the concept of modality hallucination with triplet learning to train a model with different modalities to handle missing sensors at inference time.
arXiv Detail & Related papers (2022-07-14T10:04:18Z) - ReDFeat: Recoupling Detection and Description for Multimodal Feature
Learning [51.07496081296863]
We recouple independent constraints of detection and description of multimodal feature learning with a mutual weighting strategy.
We propose a detector that possesses a large receptive field and is equipped with learnable non-maximum suppression layers.
We build a benchmark that contains cross visible, infrared, near-infrared and synthetic aperture radar image pairs for evaluating the performance of features in feature matching and image registration tasks.
arXiv Detail & Related papers (2022-05-16T04:24:22Z) - Exploring Data Augmentation for Multi-Modality 3D Object Detection [82.9988604088494]
It is counter-intuitive that multi-modality methods based on point cloud and images perform only marginally better or sometimes worse than approaches that solely use point cloud.
We propose a pipeline, named transformation flow, to bridge the gap between single and multi-modality data augmentation with transformation reversing and replaying.
Our method also wins the best PKL award in the 3rd nuScenes detection challenge.
arXiv Detail & Related papers (2020-12-23T15:23:16Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.