S-Omninet: Structured Data Enhanced Universal Multimodal Learning
Architecture
- URL: http://arxiv.org/abs/2307.00226v1
- Date: Sat, 1 Jul 2023 05:02:46 GMT
- Title: S-Omninet: Structured Data Enhanced Universal Multimodal Learning
Architecture
- Authors: Ye Xue, Diego Klabjan, Jean Utke
- Abstract summary: Multimodal multitask learning has attracted an increasing interest in recent years.
Many methods are proposed to learn on a specific type of multimodal data, such as vision and language data.
We extend and improve Omninet, an architecture that is capable of handling multiple modalities and tasks at a time.
- Score: 19.927662512903915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal multitask learning has attracted an increasing interest in recent
years. Singlemodal models have been advancing rapidly and have achieved
astonishing results on various tasks across multiple domains. Multimodal
learning offers opportunities for further improvements by integrating data from
multiple modalities. Many methods are proposed to learn on a specific type of
multimodal data, such as vision and language data. A few of them are designed
to handle several modalities and tasks at a time. In this work, we extend and
improve Omninet, an architecture that is capable of handling multiple
modalities and tasks at a time, by introducing cross-cache attention,
integrating patch embeddings for vision inputs, and supporting structured data.
The proposed Structured-data-enhanced Omninet (S-Omninet) is a universal model
that is capable of learning from structured data of various dimensions
effectively with unstructured data through cross-cache attention, which enables
interactions among spatial, temporal, and structured features. We also enhance
spatial representations in a spatial cache with patch embeddings. We evaluate
the proposed model on several multimodal datasets and demonstrate a significant
improvement over the baseline, Omninet.
Related papers
- Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning [23.035725779568587]
We study the role and interactions of multiple modalities in mitigating forgetting in deep neural networks (DNNs)
Our findings demonstrate that leveraging multiple views and complementary information from multiple modalities enables the model to learn more accurate and robust representations.
We propose a method for integrating and aligning the information from different modalities by utilizing the relational structural similarities between the data points in each modality.
arXiv Detail & Related papers (2024-05-04T22:02:58Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.
SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.
We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space.
We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z) - Learning Sequential Latent Variable Models from Multimodal Time Series
Data [6.107812768939553]
We present a self-supervised generative modelling framework to jointly learn a probabilistic latent state representation of multimodal data.
We demonstrate that our approach leads to significant improvements in prediction and representation quality.
arXiv Detail & Related papers (2022-04-21T21:59:24Z) - Multimodal Clustering Networks for Self-supervised Learning from
Unlabeled Videos [69.61522804742427]
This paper proposes a self-supervised training framework that learns a common multimodal embedding space.
We extend the concept of instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities.
The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.
arXiv Detail & Related papers (2021-04-26T15:55:01Z) - Deep Multimodal Neural Architecture Search [178.35131768344246]
We devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks.
Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone.
On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks.
arXiv Detail & Related papers (2020-04-25T07:00:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.