Foundations and Recent Trends in Multimodal Machine Learning:
Principles, Challenges, and Open Questions
- URL: http://arxiv.org/abs/2209.03430v1
- Date: Wed, 7 Sep 2022 19:21:19 GMT
- Title: Foundations and Recent Trends in Multimodal Machine Learning:
Principles, Challenges, and Open Questions
- Authors: Paul Pu Liang, Amir Zadeh, Louis-Philippe Morency
- Abstract summary: This paper provides an overview of the computational and theoretical foundations of multimodal machine learning.
We propose a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification.
Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches.
- Score: 68.6358773622615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal machine learning is a vibrant multi-disciplinary research field
that aims to design computer agents with intelligent capabilities such as
understanding, reasoning, and learning through integrating multiple
communicative modalities, including linguistic, acoustic, visual, tactile, and
physiological messages. With the recent interest in video understanding,
embodied autonomous agents, text-to-image generation, and multisensor fusion in
application domains such as healthcare and robotics, multimodal machine
learning has brought unique computational and theoretical challenges to the
machine learning community given the heterogeneity of data sources and the
interconnections often found between modalities. However, the breadth of
progress in multimodal research has made it difficult to identify the common
themes and open questions in the field. By synthesizing a broad range of
application domains and theoretical frameworks from both historical and recent
perspectives, this paper is designed to provide an overview of the
computational and theoretical foundations of multimodal machine learning. We
start by defining two key principles of modality heterogeneity and
interconnections that have driven subsequent innovations, and propose a
taxonomy of 6 core technical challenges: representation, alignment, reasoning,
generation, transference, and quantification covering historical and recent
trends. Recent technical achievements will be presented through the lens of
this taxonomy, allowing researchers to understand the similarities and
differences across new approaches. We end by motivating several open problems
for future research as identified by our taxonomy.
Related papers
- Foundations of Multisensory Artificial Intelligence [32.56967614091527]
This thesis aims to advance the machine learning foundations of multisensory AI.
In the first part, we present a theoretical framework formalizing how modalities interact with each other to give rise to new information for a task.
In the second part, we study the design of practical multimodal foundation models that generalize over many modalities and tasks.
arXiv Detail & Related papers (2024-04-29T14:45:28Z) - Multimodal Large Language Models: A Survey [36.06016060015404]
Multimodal language models integrate multiple data types, such as images, text, language, audio, and other heterogeneity.
This paper begins by defining the concept of multimodal and examining the historical development of multimodal algorithms.
A practical guide is provided, offering insights into the technical aspects of multimodal models.
Lastly, we explore the applications of multimodal models and discuss the challenges associated with their development.
arXiv Detail & Related papers (2023-11-22T05:15:12Z) - Multimodal Foundation Models: From Specialists to General-Purpose
Assistants [187.72038587829223]
The research landscape encompasses five core topics, categorized into two classes.
The target audiences of the paper are researchers, graduate students, and professionals in computer vision and vision-language multimodal communities.
arXiv Detail & Related papers (2023-09-18T17:56:28Z) - Machine Unlearning: A Survey [56.79152190680552]
A special need has arisen where, due to privacy, usability, and/or the right to be forgotten, information about some specific samples needs to be removed from a model, called machine unlearning.
This emerging technology has drawn significant interest from both academics and industry due to its innovation and practicality.
No study has analyzed this complex topic or compared the feasibility of existing unlearning solutions in different kinds of scenarios.
The survey concludes by highlighting some of the outstanding issues with unlearning techniques, along with some feasible directions for new research opportunities.
arXiv Detail & Related papers (2023-06-06T10:18:36Z) - Foundation Models for Decision Making: Problems, Methods, and
Opportunities [124.79381732197649]
Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks.
New paradigms are emerging for training foundation models to interact with other agents and perform long-term reasoning.
Research at the intersection of foundation models and decision making holds tremendous promise for creating powerful new systems.
arXiv Detail & Related papers (2023-03-07T18:44:07Z) - Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area.
We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions.
We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z) - Multimodal Image Synthesis and Editing: The Generative AI Era [131.9569600472503]
multimodal image synthesis and editing has become a hot research topic in recent years.
We comprehensively contextualize the advance of the recent multimodal image synthesis and editing.
We describe benchmark datasets and evaluation metrics as well as corresponding experimental results.
arXiv Detail & Related papers (2021-12-27T10:00:16Z) - Multimodality in Meta-Learning: A Comprehensive Survey [34.69292359136745]
This survey provides a comprehensive overview of the multimodality-based meta-learning landscape.
We first formalize the definition of meta-learning and multimodality, along with the research challenges in this growing field.
We then propose a new taxonomy to systematically discuss typical meta-learning algorithms combined with multimodal tasks.
arXiv Detail & Related papers (2021-09-28T09:16:12Z) - A Review on Explainability in Multimodal Deep Neural Nets [2.3204178451683264]
multimodal AI techniques have achieved much success in several application domains.
Despite their outstanding performance, the complex, opaque and black-box nature of the deep neural nets limits their social acceptance and usability.
This paper extensively reviews the present literature to present a comprehensive survey and commentary on the explainability in multimodal deep neural nets.
arXiv Detail & Related papers (2021-05-17T14:17:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.