Related papers: Advancing Multi-Modal Sensing Through Expandable Modality Alignment

Advancing Multi-Modal Sensing Through Expandable Modality Alignment

URL: http://arxiv.org/abs/2407.17777v1
Date: Thu, 25 Jul 2024 05:10:48 GMT
Title: Advancing Multi-Modal Sensing Through Expandable Modality Alignment
Authors: Shenghong Dai, Shiqi Jiang, Yifan Yang, Ting Cao, Mo Li, Suman Banerjee, Lili Qiu,
Abstract summary: We introduce the Babel framework, encompassing the neural network architecture, data preparation and processing, as well as the training strategies. Babel serves as a scalable pre-trained multi-modal sensing neural network, currently aligning six sensing modalities. Babel not only effectively fuses multiple available modalities (up to 22% accuracy increase), but also enhance the performance of individual modality.
Score: 14.0873117319398
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sensing technology is widely used for comprehending the physical world, with numerous modalities explored in past decades. While there has been considerable work on multi-modality learning, they all require data of all modalities be paired. How to leverage multi-modality data with partially pairings remains an open problem. To tackle this challenge, we introduce the Babel framework, encompassing the neural network architecture, data preparation and processing, as well as the training strategies. Babel serves as a scalable pre-trained multi-modal sensing neural network, currently aligning six sensing modalities, namely Wi-Fi, mmWave, IMU, LiDAR, video, and depth. To overcome the scarcity of complete paired data, the key idea of Babel involves transforming the N-modality alignment into a series of two-modality alignments by devising the expandable network architecture. This concept is also realized via a series of novel techniques, including the pre-trained modality tower that capitalizes on available single-modal networks, and the adaptive training strategy balancing the contribution of the newly incorporated modality with the previously established modality alignment. Evaluation demonstrates Babel's outstanding performance on eight human activity recognition datasets, compared to various baselines e.g., the top multi-modal sensing framework, single-modal sensing networks, and multi-modal large language models. Babel not only effectively fuses multiple available modalities (up to 22% accuracy increase), but also enhance the performance of individual modality (12% averaged accuracy improvement). Case studies also highlight exciting application scenarios empowered by Babel, including cross-modality retrieval (i.e., sensing imaging), and bridging LLM for sensing comprehension.

Related papers

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities. PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z)
BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains. BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z)
Noise-powered Multi-modal Knowledge Graph Representation Framework [52.95468915728721]
The rise of Multi-modal Pre-training highlights the necessity for a unified Multi-Modal Knowledge Graph representation learning framework. We propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking. Our approach achieves SOTA performance across a total of ten datasets, demonstrating its versatility.
arXiv Detail & Related papers (2024-03-11T15:48:43Z)
Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment. Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z)
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities. We build a multimodal text-centric dataset for multimodal alignment pre-training. We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z)
AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with Ten Modalities via Language as a Reference Framework [21.10693332367192]
We present AllSpark, a multimodal-temporal general artificial intelligence model. Our model integrates ten different modalities into a unified framework. Experiments indicate that the incorporation of language enables AllSpark to excel in few-shot classification tasks.
arXiv Detail & Related papers (2023-12-31T17:21:02Z)
Semi-supervised Multimodal Representation Learning through a Global Workspace [2.8948274245812335]
"Global Workspace" is a shared representation for two input modalities. This architecture is amenable to self-supervised training via cycle-consistency. We show that such an architecture can be trained to align and translate between two modalities with very little need for matched data.
arXiv Detail & Related papers (2023-06-27T12:41:36Z)
LANISTR: Multimodal Learning from Structured and Unstructured Data [33.73687295669768]
LANISTR is an attention-based framework to learn from LANguage, Image, and STRuctured data. In particular, we introduce a new similarity-based multimodal masking loss that enables it to learn cross-modal relations from large-scale multimodal data with missing modalities.
arXiv Detail & Related papers (2023-05-26T00:50:09Z)
Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation [27.23513712371972]
We propose a simple yet efficient multi-modal fusion mechanism Linear Fusion. We also propose M3L: Multi-modal Teacher for Masked Modality Learning. Our proposal shows an absolute improvement of up to 10% on robust mIoU above the most competitive baselines.
arXiv Detail & Related papers (2023-04-21T05:52:50Z)
Unimodal Training-Multimodal Prediction: Cross-modal Federated Learning with Hierarchical Aggregation [16.308470947384134]
HA-Fedformer is a novel transformer-based model that empowers unimodal training with only a unimodal dataset at the client. We develop an uncertainty-aware aggregation method for the local encoders with layer-wise Markov Chain Monte Carlo sampling. Our experiments on popular sentiment analysis benchmarks, CMU-MOSI and CMU-MOSEI, demonstrate that HA-Fedformer significantly outperforms state-of-the-art multimodal models.
arXiv Detail & Related papers (2023-03-27T07:07:33Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space. We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z)
i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning. Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z)
Routing with Self-Attention for Multimodal Capsule Networks [108.85007719132618]
We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework. To adapt the capsules to large-scale input data, we propose a novel routing by self-attention mechanism that selects relevant capsules. This allows not only for robust training with noisy video data, but also to scale up the size of the capsule network compared to traditional routing methods.
arXiv Detail & Related papers (2021-12-01T19:01:26Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations. Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)
Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction. We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data. Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.