Multimodal Graph Learning for Generative Tasks
- URL: http://arxiv.org/abs/2310.07478v2
- Date: Thu, 12 Oct 2023 17:07:24 GMT
- Title: Multimodal Graph Learning for Generative Tasks
- Authors: Minji Yoon, Jing Yu Koh, Bryan Hooi, Ruslan Salakhutdinov
- Abstract summary: Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize.
We propose Multimodal Graph Learning (MMGL), a framework for capturing information from multiple multimodal neighbors with relational structures among them.
- Score: 89.44810441463652
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal learning combines multiple data modalities, broadening the types
and complexity of data our models can utilize: for example, from plain text to
image-caption pairs. Most multimodal learning algorithms focus on modeling
simple one-to-one pairs of data from two modalities, such as image-caption
pairs, or audio-text pairs. However, in most real-world settings, entities of
different modalities interact with each other in more complex and multifaceted
ways, going beyond one-to-one mappings. We propose to represent these complex
relationships as graphs, allowing us to capture data with any number of
modalities, and with complex relationships between modalities that can flexibly
vary from one sample to another. Toward this goal, we propose Multimodal Graph
Learning (MMGL), a general and systematic framework for capturing information
from multiple multimodal neighbors with relational structures among them. In
particular, we focus on MMGL for generative tasks, building upon pretrained
Language Models (LMs), aiming to augment their text generation with multimodal
neighbor contexts. We study three research questions raised by MMGL: (1) how
can we infuse multiple neighbor information into the pretrained LMs, while
avoiding scalability issues? (2) how can we infuse the graph structure
information among multimodal neighbors into the LMs? and (3) how can we
finetune the pretrained LMs to learn from the neighbor context in a
parameter-efficient manner? We conduct extensive experiments to answer these
three questions on MMGL and analyze the empirical results to pave the way for
future MMGL research.
Related papers
- NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - How to Bridge the Gap between Modalities: A Comprehensive Survey on
Multimodal Large Language Model [12.890344377484759]
This review paper explores Multimodal Large Language Models (MLLMs)
MLLMs integrate Large Language Models (LLMs) like GPT-4 to handle multimodal data such as text and vision.
Choosing the appropriate modality alignment method is crucial, as improper methods might require more parameters with limited performance improvement.
arXiv Detail & Related papers (2023-11-10T09:51:24Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Multimodal Understanding Through Correlation Maximization and
Minimization [23.8764755753415]
We study the intrinsic nature of multimodal data by asking the following questions.
Can we learn more structured latent representations of general multimodal data?
Can we intuitively understand, both mathematically and visually, what the latent representations capture?
arXiv Detail & Related papers (2023-05-04T19:53:05Z) - Self-Supervised Multimodal Learning: A Survey [23.526389924804207]
Multimodal learning aims to understand and analyze information from multiple modalities.
The heavy dependence on data paired with expensive human annotations impedes scaling up models.
Given the availability of large-scale unannotated data in the wild, self-supervised learning has become an attractive strategy to alleviate the annotation bottleneck.
arXiv Detail & Related papers (2023-03-31T16:11:56Z) - High-Modality Multimodal Transformer: Quantifying Modality & Interaction
Heterogeneity for High-Modality Representation Learning [112.51498431119616]
This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities.
A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
arXiv Detail & Related papers (2022-03-02T18:56:20Z) - Unsupervised Multimodal Language Representations using Convolutional
Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks.
We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets.
It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.