MESEN: Exploit Multimodal Data to Design Unimodal Human Activity Recognition with Few Labels
- URL: http://arxiv.org/abs/2404.01958v1
- Date: Tue, 2 Apr 2024 13:54:05 GMT
- Title: MESEN: Exploit Multimodal Data to Design Unimodal Human Activity Recognition with Few Labels
- Authors: Lilin Xu, Chaojie Gu, Rui Tan, Shibo He, Jiming Chen,
- Abstract summary: MESEN is a multimodal-empowered unimodal sensing framework.
Mesen exploits unlabeled multimodal data to extract effective unimodal features for each modality.
Mesen achieves significant performance improvements over state-of-the-art baselines.
- Score: 11.853566358505434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human activity recognition (HAR) will be an essential function of various emerging applications. However, HAR typically encounters challenges related to modality limitations and label scarcity, leading to an application gap between current solutions and real-world requirements. In this work, we propose MESEN, a multimodal-empowered unimodal sensing framework, to utilize unlabeled multimodal data available during the HAR model design phase for unimodal HAR enhancement during the deployment phase. From a study on the impact of supervised multimodal fusion on unimodal feature extraction, MESEN is designed to feature a multi-task mechanism during the multimodal-aided pre-training stage. With the proposed mechanism integrating cross-modal feature contrastive learning and multimodal pseudo-classification aligning, MESEN exploits unlabeled multimodal data to extract effective unimodal features for each modality. Subsequently, MESEN can adapt to downstream unimodal HAR with only a few labeled samples. Extensive experiments on eight public multimodal datasets demonstrate that MESEN achieves significant performance improvements over state-of-the-art baselines in enhancing unimodal HAR by exploiting multimodal data.
Related papers
- Alt-MoE: Multimodal Alignment via Alternating Optimization of Multi-directional MoE with Unimodal Models [7.134682404460003]
We introduce a novel training framework, Alt-MoE, which employs the Mixture of Experts (MoE) as a unified multi-directional connector across modalities.
Our methodology has been validated on several well-performing uni-modal models.
arXiv Detail & Related papers (2024-09-09T10:40:50Z) - Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities.
Specifically, it features modality-specific encoders with connectors for a unified multimodal representation.
We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Self-MI: Efficient Multimodal Fusion via Self-Supervised Multi-Task
Learning with Auxiliary Mutual Information Maximization [2.4660652494309936]
Multimodal representation learning poses significant challenges.
Existing methods often struggle to exploit the unique characteristics of each modality.
In this study, we propose Self-MI in the self-supervised learning fashion.
arXiv Detail & Related papers (2023-11-07T08:10:36Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Distilled Mid-Fusion Transformer Networks for Multi-Modal Human Activity
Recognition [34.424960016807795]
Multi-modal Human Activity Recognition could utilize the complementary information to build models that can generalize well.
Deep learning methods have shown promising results, their potential in extracting salient multi-modal spatial-temporal features has not been fully explored.
A knowledge distillation-based Multi-modal Mid-Fusion approach, DMFT, is proposed to conduct informative feature extraction and fusion to resolve the Multi-modal Human Activity Recognition task efficiently.
arXiv Detail & Related papers (2023-05-05T19:26:06Z) - SHAPE: An Unified Approach to Evaluate the Contribution and Cooperation
of Individual Modalities [7.9602600629569285]
We use bf SHapley vbf Alue-based bf PErceptual (SHAPE) scores to measure the marginal contribution of individual modalities and the degree of cooperation across modalities.
Our experiments suggest that for some tasks where different modalities are complementary, the multi-modal models still tend to use the dominant modality alone.
We hope our scores can help improve the understanding of how the present multi-modal models operate on different modalities and encourage more sophisticated methods of integrating multiple modalities.
arXiv Detail & Related papers (2022-04-30T16:35:40Z) - How to Sense the World: Leveraging Hierarchy in Multimodal Perception
for Robust Reinforcement Learning Agents [9.840104333194663]
We argue for hierarchy in the design of representation models and contribute with a novel multimodal representation model, MUSE.
MUSE is the sensory representation model of deep reinforcement learning agents provided with multimodal observations in Atari games.
We perform a comparative study over different designs of reinforcement learning agents, showing that MUSE allows agents to perform tasks under incomplete perceptual experience with minimal performance loss.
arXiv Detail & Related papers (2021-10-07T16:35:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.