Self-Supervised Multimodal Fusion Transformer for Passive Activity
Recognition
- URL: http://arxiv.org/abs/2209.03765v1
- Date: Mon, 15 Aug 2022 15:38:10 GMT
- Title: Self-Supervised Multimodal Fusion Transformer for Passive Activity
Recognition
- Authors: Armand K. Koupai, Mohammud J. Bocus, Raul Santos-Rodriguez, Robert J.
Piechocki, Ryan McConville
- Abstract summary: Wi-Fi signals provide significant opportunities for human sensing and activity recognition in fields such as healthcare.
Current systems do not effectively exploit the information acquired through multiple sensors to recognise the different activities.
We propose the Fusion Transformer, an attention-based model for multimodal and multi-sensor fusion.
- Score: 2.35066982314539
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The pervasiveness of Wi-Fi signals provides significant opportunities for
human sensing and activity recognition in fields such as healthcare. The
sensors most commonly used for passive Wi-Fi sensing are based on passive Wi-Fi
radar (PWR) and channel state information (CSI) data, however current systems
do not effectively exploit the information acquired through multiple sensors to
recognise the different activities. In this paper, we explore new properties of
the Transformer architecture for multimodal sensor fusion. We study different
signal processing techniques to extract multiple image-based features from PWR
and CSI data such as spectrograms, scalograms and Markov transition field
(MTF). We first propose the Fusion Transformer, an attention-based model for
multimodal and multi-sensor fusion. Experimental results show that our Fusion
Transformer approach can achieve competitive results compared to a ResNet
architecture but with much fewer resources. To further improve our model, we
propose a simple and effective framework for multimodal and multi-sensor
self-supervised learning (SSL). The self-supervised Fusion Transformer
outperforms the baselines, achieving a F1-score of 95.9%. Finally, we show how
this approach significantly outperforms the others when trained with as little
as 1% (2 minutes) of labelled training data to 20% (40 minutes) of labelled
training data.
Related papers
- SeaDATE: Remedy Dual-Attention Transformer with Semantic Alignment via Contrast Learning for Multimodal Object Detection [18.090706979440334]
Multimodal object detection leverages diverse modal information to enhance the accuracy and robustness of detectors.
Current methods merely stack Transformer-guided fusion techniques without exploring their capability to extract features at various depth layers of network.
In this paper, we introduce an accurate and efficient object detection method named SeaDATE.
arXiv Detail & Related papers (2024-10-15T07:26:39Z) - Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z) - Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions.
We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training.
Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z) - Multi-scale Transformer-based Network for Emotion Recognition from Multi
Physiological Signals [11.479653866646762]
This paper presents an efficient Multi-scale Transformer-based approach for the task of Emotion recognition from Physiological data.
Our approach involves applying a Multi-modal technique combined with scaling data to establish the relationship between internal body signals and human emotions.
Our model achieves decent results on the CASE dataset of the EPiC competition, with an RMSE score of 1.45.
arXiv Detail & Related papers (2023-05-01T11:10:48Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - Robust Multimodal Fusion for Human Activity Recognition [5.858726030608716]
We propose Centaur, a multimodal fusion model for human activity recognition (HAR) that is robust to data quality issues.
A Centaur data cleaning module outperforms 2 state-of-the-art autoencoder-based models and its multimodal fusion module outperforms 4 strong baselines.
Compared to 2 related robust fusion architectures, Centaur is more robust, achieving 11.59-17.52% higher accuracy in HAR.
arXiv Detail & Related papers (2023-03-08T14:56:11Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - UMSNet: An Universal Multi-sensor Network for Human Activity Recognition [10.952666953066542]
This paper proposes a universal multi-sensor network (UMSNet) for human activity recognition.
In particular, we propose a new lightweight sensor residual block (called LSR block), which improves the performance.
Our framework has a clear structure and can be directly applied to various types of multi-modal Time Series Classification tasks.
arXiv Detail & Related papers (2022-05-24T03:29:54Z) - Robust Semi-supervised Federated Learning for Images Automatic
Recognition in Internet of Drones [57.468730437381076]
We present a Semi-supervised Federated Learning (SSFL) framework for privacy-preserving UAV image recognition.
There are significant differences in the number, features, and distribution of local data collected by UAVs using different camera modules.
We propose an aggregation rule based on the frequency of the client's participation in training, namely the FedFreq aggregation rule.
arXiv Detail & Related papers (2022-01-03T16:49:33Z) - Cross-Modality Fusion Transformer for Multispectral Object Detection [0.0]
Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust.
We present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper.
arXiv Detail & Related papers (2021-10-30T15:34:12Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.