MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition
- URL: http://arxiv.org/abs/2504.02279v2
- Date: Mon, 07 Apr 2025 11:53:15 GMT
- Title: MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition
- Authors: Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide,
- Abstract summary: Action recognition from multi-modal and multi-view observations holds significant potential for applications in surveillance, robotics, and smart environments.<n>We propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF)<n>The proposed method leverages a Transformer-based to dynamically model inter-view relationships and capture temporal dependencies across multiple views.
- Score: 2.7745600113170994
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Action recognition from multi-modal and multi-view observations holds significant potential for applications in surveillance, robotics, and smart environments. However, existing methods often fall short of addressing real-world challenges such as diverse environmental conditions, strict sensor synchronization, and the need for fine-grained annotations. In this study, we propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF). The proposed method leverages a Transformer-based to dynamically model inter-view relationships and capture temporal dependencies across multiple views. Additionally, we introduce a Human Detection Module to generate pseudo-ground-truth labels, enabling the model to prioritize frames containing human activity and enhance spatial feature learning. Comprehensive experiments conducted on our in-house MultiSensor-Home dataset and the existing MM-Office dataset demonstrate that MultiTSF outperforms state-of-the-art methods in both video sequence-level and frame-level action recognition settings.
Related papers
- Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision [25.31489336119893]
We systematically review the applications of multimodal fusion in key robotic vision tasks.<n>We compare vision-language models (VLMs) with traditional multimodal fusion methods, analyzing their advantages, limitations, and synergies.<n>We identify critical research challenges such as cross-modal alignment, efficient fusion strategies, real-time deployment, and domain adaptation.
arXiv Detail & Related papers (2025-04-03T10:53:07Z) - MultiSensor-Home: A Wide-area Multi-modal Multi-view Dataset for Action Recognition and Transformer-based Sensor Fusion [2.7745600113170994]
Multi-modal multi-view action recognition is a rapidly growing field in computer vision.<n>Current datasets often fail to address real-world challenges such as wide-area environmental conditions, asynchronous data streams, and the lack of frame-level annotations.<n>We propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF) method and introduce the MultiSensor-Home dataset.
arXiv Detail & Related papers (2025-04-03T05:23:08Z) - Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation [61.64052577026623]
Real-world multi-view datasets are often heterogeneous and imperfect.<n>We propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment.<n>In experiments, we employ it in unsupervised multi-view clustering, noise-label classification, and as a plug-and-play module for cross-modal hashing retrieval.
arXiv Detail & Related papers (2025-03-06T07:01:08Z) - SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.<n>It is designed to accurately detect horizontal or oriented objects from any sensor modality.<n>This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - Reliable Object Tracking by Multimodal Hybrid Feature Extraction and Transformer-Based Fusion [18.138433117711177]
We propose a novel multimodal hybrid tracker (MMHT) that utilizes frame-event-based data for reliable single object tracking.
The MMHT model employs a hybrid backbone consisting of an artificial neural network (ANN) and a spiking neural network (SNN) to extract dominant features from different visual modalities.
Extensive experiments demonstrate that the MMHT model exhibits competitive performance in comparison with other state-of-the-art methods.
arXiv Detail & Related papers (2024-05-28T07:24:56Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images [1.662438436885552]
Multi-modal fusion has been determined to enhance the accuracy by fusing data from multiple modalities.
We propose a novel multi-modal fusion strategy for mapping relationships between different channels at the early stage.
By addressing fusion in the early stage, as opposed to mid or late-stage methods, our method achieves competitive and even superior performance compared to existing techniques.
arXiv Detail & Related papers (2023-10-21T00:56:11Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition [45.0131792009999]
We propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition.
Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information.
Our network outperforms state-of-the-art approaches in most standard evaluation settings.
arXiv Detail & Related papers (2023-07-22T03:51:32Z) - Distilled Mid-Fusion Transformer Networks for Multi-Modal Human Activity
Recognition [34.424960016807795]
Multi-modal Human Activity Recognition could utilize the complementary information to build models that can generalize well.
Deep learning methods have shown promising results, their potential in extracting salient multi-modal spatial-temporal features has not been fully explored.
A knowledge distillation-based Multi-modal Mid-Fusion approach, DMFT, is proposed to conduct informative feature extraction and fusion to resolve the Multi-modal Human Activity Recognition task efficiently.
arXiv Detail & Related papers (2023-05-05T19:26:06Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Multi-modal Fusion for Single-Stage Continuous Gesture Recognition [45.19890687786009]
We introduce a single-stage continuous gesture recognition framework, called Temporal Multi-Modal Fusion (TMMF)
TMMF can detect and classify multiple gestures in a video via a single model.
This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation step.
arXiv Detail & Related papers (2020-11-10T07:09:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.