What is Multimodality?
- URL: http://arxiv.org/abs/2103.06304v1
- Date: Wed, 10 Mar 2021 19:14:07 GMT
- Title: What is Multimodality?
- Authors: Letitia Parcalabescu, Nils Trost, Anette Frank
- Abstract summary: We explain how the field uses outdated definitions of multimodality that prove unfit for the machine learning era.
We propose a new task-relative definition of (multi)modality in the context of multimodal machine learning.
- Score: 13.922507071009958
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The last years have shown rapid developments in the field of multimodal
machine learning, combining e.g., vision, text or speech. In this position
paper we explain how the field uses outdated definitions of multimodality that
prove unfit for the machine learning era. We propose a new task-relative
definition of (multi)modality in the context of multimodal machine learning
that focuses on representations and information that are relevant for a given
machine learning task. With our new definition of multimodality we aim to
provide a missing foundation for multimodal research, an important component of
language grounding and a crucial milestone towards NLU.
Related papers
- LLMs Meet Multimodal Generation and Editing: A Survey [89.76691959033323]
This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio.
We summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods.
We dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction.
arXiv Detail & Related papers (2024-05-29T17:59:20Z) - What Makes Multimodal In-Context Learning Work? [58.48612721156335]
We present a framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models.
M-ICL primarily relies on text-driven mechanisms, showing little to no influence from the image modality.
We identify several biases and limitations of M-ICL that warrant consideration prior to deployment.
arXiv Detail & Related papers (2024-04-24T08:50:45Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - On the Computational Benefit of Multimodal Learning [3.4991031406102238]
We show that, under certain conditions, multimodal learning can outpace unimodal learning exponentially in terms of computation.
Specifically, we present a learning task that is NP-hard for unimodal learning but is solvable in time by a multimodal algorithm.
arXiv Detail & Related papers (2023-09-25T00:20:50Z) - Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection [72.36017150922504]
We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector.
The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
arXiv Detail & Related papers (2023-08-30T08:33:13Z) - Multimodality Representation Learning: A Survey on Evolution,
Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations.
Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task.
This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z) - Foundations and Recent Trends in Multimodal Machine Learning:
Principles, Challenges, and Open Questions [68.6358773622615]
This paper provides an overview of the computational and theoretical foundations of multimodal machine learning.
We propose a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification.
Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches.
arXiv Detail & Related papers (2022-09-07T19:21:19Z) - Multi-Task Learning for Visual Scene Understanding [7.191593674138455]
This thesis is concerned with multi-task learning in the context of computer vision.
We propose several methods that tackle important aspects of multi-task learning.
The results show several advances in the state-of-the-art of multi-task learning.
arXiv Detail & Related papers (2022-03-28T16:57:58Z) - Channel Exchanging Networks for Multimodal and Multitask Dense Image
Prediction [125.18248926508045]
We propose Channel-Exchanging-Network (CEN) which is self-adaptive, parameter-free, and more importantly, applicable for both multimodal fusion and multitask learning.
CEN dynamically exchanges channels betweenworks of different modalities.
For the application of dense image prediction, the validity of CEN is tested by four different scenarios.
arXiv Detail & Related papers (2021-12-04T05:47:54Z) - Revisit Multimodal Meta-Learning through the Lens of Multi-Task Learning [33.19179706038397]
Multimodal meta-learning is a recent problem that extends conventional few-shot meta-learning by generalizing its setup to diverse multimodal task distributions.
Previous work claims that a single meta-learner trained on a multimodal distribution can sometimes outperform multiple specialized meta-learners trained on individual unimodal distributions.
Our work makes two contributions to multimodal meta-learning. First, we propose a method to quantify knowledge transfer between tasks of different modes at a micro-level.
Second, inspired by hard parameter sharing in multi-task learning and a new interpretation of related work, we propose a new multimodal meta-learn
arXiv Detail & Related papers (2021-10-27T06:23:45Z) - Unsupervised Multimodal Language Representations using Convolutional
Autoencoders [5.464072883537924]
We propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks.
We map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets.
It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters.
arXiv Detail & Related papers (2021-10-06T18:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.