Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity
Recognition
- URL: http://arxiv.org/abs/2211.04331v1
- Date: Tue, 8 Nov 2022 15:48:44 GMT
- Title: Multi-Stage Based Feature Fusion of Multi-Modal Data for Human Activity
Recognition
- Authors: Hyeongju Choi, Apoorva Beedu, Harish Haresamudram, Irfan Essa
- Abstract summary: We propose a multi-modal framework that learns to effectively combine features from RGB Video and IMU sensors.
Our model is trained in two-stage, where in the first stage, each input encoder learns to effectively extract features.
We show significant improvements of 22% and 11% compared to video only, and 20% and 12% on MMAct datasets.
- Score: 6.0306313759213275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To properly assist humans in their needs, human activity recognition (HAR)
systems need the ability to fuse information from multiple modalities. Our
hypothesis is that multimodal sensors, visual and non-visual tend to provide
complementary information, addressing the limitations of other modalities. In
this work, we propose a multi-modal framework that learns to effectively
combine features from RGB Video and IMU sensors, and show its robustness for
MMAct and UTD-MHAD datasets. Our model is trained in two-stage, where in the
first stage, each input encoder learns to effectively extract features, and in
the second stage, learns to combine these individual features. We show
significant improvements of 22% and 11% compared to video only and IMU only
setup on UTD-MHAD dataset, and 20% and 12% on MMAct datasets. Through extensive
experimentation, we show the robustness of our model on zero shot setting, and
limited annotated data setting. We further compare with state-of-the-art
methods that use more input modalities and show that our method outperforms
significantly on the more difficult MMact dataset, and performs comparably in
UTD-MHAD dataset.
Related papers
- MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.39859547619156]
We propose MMEvol, a novel multimodal instruction data evolution framework.
MMEvol iteratively improves data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution.
Our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
arXiv Detail & Related papers (2024-09-09T17:44:00Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - MuJo: Multimodal Joint Feature Space Learning for Human Activity Recognition [2.7532797256542403]
Human Activity Recognition (HAR) is a longstanding problem in AI with applications in a broad range of areas, including healthcare, sports and fitness, security, and more.
We introduce our comprehensive Fitness Multimodal Activity dataset (FiMAD) to enhance HAR performance across various modalities.
We show that classifiers pre-trained on FiMAD can increase the performance on real HAR datasets such as MM-Fit, MyoGym, MotionSense, and MHEALTH.
arXiv Detail & Related papers (2024-06-06T08:42:36Z) - Enhancing Inertial Hand based HAR through Joint Representation of Language, Pose and Synthetic IMUs [9.570759294459629]
We propose Multi$3$Net, our novel multi-modal, multitask, and contrastive-based framework approach to address the issue of limited data.
Our method seeks to enhance wearable HAR performance, especially in recognizing subtle activities.
arXiv Detail & Related papers (2024-06-03T13:28:42Z) - AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection [23.91870504363899]
Double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data.
This has hindered the widespread employment of multispectral pedestrian detection in embedded devices for autonomous systems.
We introduce the Adaptive Modal Fusion Distillation (AMFD) framework, which can fully utilize the original modal features of the teacher network.
arXiv Detail & Related papers (2024-05-21T17:17:17Z) - Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Progressive Cross-modal Knowledge Distillation for Human Action
Recognition [10.269019492921306]
We propose a novel Progressive Skeleton-to-sensor Knowledge Distillation (PSKD) model for solving the wearable sensor-based HAR problem.
Specifically, we construct multiple teacher models using data from both teacher (human skeleton sequence) and student (time-series accelerometer data) modalities.
arXiv Detail & Related papers (2022-08-17T06:06:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.