MultiBench: Multiscale Benchmarks for Multimodal Representation Learning
- URL: http://arxiv.org/abs/2107.07502v1
- Date: Thu, 15 Jul 2021 17:54:36 GMT
- Title: MultiBench: Multiscale Benchmarks for Multimodal Representation Learning
- Authors: Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu,
Leslie Chen, Peter Wu, Michelle A. Lee, Yuke Zhu, Ruslan Salakhutdinov,
Louis-Philippe Morency
- Abstract summary: MultiBench is a systematic and unified benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
It provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation.
It introduces impactful challenges for future research, including robustness to large-scale multimodal datasets and robustness to realistic imperfections.
- Score: 87.23266008930045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning multimodal representations involves integrating information from
multiple heterogeneous sources of data. It is a challenging yet crucial area
with numerous real-world applications in multimedia, affective computing,
robotics, finance, human-computer interaction, and healthcare. Unfortunately,
multimodal research has seen limited resources to study (1) generalization
across domains and modalities, (2) complexity during training and inference,
and (3) robustness to noisy and missing modalities. In order to accelerate
progress towards understudied modalities and tasks while ensuring real-world
robustness, we release MultiBench, a systematic and unified large-scale
benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6
research areas. MultiBench provides an automated end-to-end machine learning
pipeline that simplifies and standardizes data loading, experimental setup, and
model evaluation. To enable holistic evaluation, MultiBench offers a
comprehensive methodology to assess (1) generalization, (2) time and space
complexity, and (3) modality robustness. MultiBench introduces impactful
challenges for future research, including scalability to large-scale multimodal
datasets and robustness to realistic imperfections. To accompany this
benchmark, we also provide a standardized implementation of 20 core approaches
in multimodal learning. Simply applying methods proposed in different research
areas can improve the state-of-the-art performance on 9/15 datasets. Therefore,
MultiBench presents a milestone in unifying disjoint efforts in multimodal
research and paves the way towards a better understanding of the capabilities
and limitations of multimodal models, all the while ensuring ease of use,
accessibility, and reproducibility. MultiBench, our standardized code, and
leaderboards are publicly available, will be regularly updated, and welcomes
inputs from the community.
Related papers
- HEMM: Holistic Evaluation of Multimodal Foundation Models [91.60364024897653]
Multimodal foundation models can holistically process text alongside images, video, audio, and other sensory modalities.
It is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains.
arXiv Detail & Related papers (2024-07-03T18:00:48Z) - Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning [23.035725779568587]
We study the role and interactions of multiple modalities in mitigating forgetting in deep neural networks (DNNs)
Our findings demonstrate that leveraging multiple views and complementary information from multiple modalities enables the model to learn more accurate and robust representations.
We propose a method for integrating and aligning the information from different modalities by utilizing the relational structural similarities between the data points in each modality.
arXiv Detail & Related papers (2024-05-04T22:02:58Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep
Learning [110.54752872873472]
MultiZoo is a public toolkit consisting of standardized implementations of > 20 core multimodal algorithms.
MultiBench is a benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
arXiv Detail & Related papers (2023-06-28T17:59:10Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - On Robustness in Multimodal Learning [75.03719000820388]
Multimodal learning is defined as learning over multiple input modalities such as video, audio, and text.
We present a multimodal robustness framework to provide a systematic analysis of common multimodal representation learning methods.
arXiv Detail & Related papers (2023-04-10T05:02:07Z) - Multimodality Representation Learning: A Survey on Evolution,
Pretraining and Its Applications [47.501121601856795]
Multimodality Representation Learning is a technique of learning to embed information from different modalities and their correlations.
Cross-modal interaction and complementary information from different modalities are crucial for advanced models to perform any multimodal task.
This survey presents the literature on the evolution and enhancement of deep learning multimodal architectures.
arXiv Detail & Related papers (2023-02-01T11:48:34Z) - Generalized Product-of-Experts for Learning Multimodal Representations
in Noisy Environments [18.14974353615421]
We propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique.
In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality.
We attain state-of-the-art performance on two challenging benchmarks: multimodal 3D hand-pose estimation and multimodal surgical video segmentation.
arXiv Detail & Related papers (2022-11-07T14:27:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.