MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep
Learning
- URL: http://arxiv.org/abs/2306.16413v1
- Date: Wed, 28 Jun 2023 17:59:10 GMT
- Title: MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep
Learning
- Authors: Paul Pu Liang, Yiwei Lyu, Xiang Fan, Arav Agarwal, Yun Cheng,
Louis-Philippe Morency, Ruslan Salakhutdinov
- Abstract summary: MultiZoo is a public toolkit consisting of standardized implementations of > 20 core multimodal algorithms.
MultiBench is a benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
- Score: 110.54752872873472
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning multimodal representations involves integrating information from
multiple heterogeneous sources of data. In order to accelerate progress towards
understudied modalities and tasks while ensuring real-world robustness, we
release MultiZoo, a public toolkit consisting of standardized implementations
of > 20 core multimodal algorithms and MultiBench, a large-scale benchmark
spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
Together, these provide an automated end-to-end machine learning pipeline that
simplifies and standardizes data loading, experimental setup, and model
evaluation. To enable holistic evaluation, we offer a comprehensive methodology
to assess (1) generalization, (2) time and space complexity, and (3) modality
robustness. MultiBench paves the way towards a better understanding of the
capabilities and limitations of multimodal models, while ensuring ease of use,
accessibility, and reproducibility. Our toolkits are publicly available, will
be regularly updated, and welcome inputs from the community.
Related papers
- VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models [89.63342806812413]
We present an open-source toolkit for evaluating large multi-modality models based on PyTorch.
VLMEvalKit implements over 70 different large multi-modality models, including both proprietary APIs and open-source models.
We host OpenVLM Leaderboard to track the progress of multi-modality learning research.
arXiv Detail & Related papers (2024-07-16T13:06:15Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Generalized Product-of-Experts for Learning Multimodal Representations
in Noisy Environments [18.14974353615421]
We propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique.
In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality.
We attain state-of-the-art performance on two challenging benchmarks: multimodal 3D hand-pose estimation and multimodal surgical video segmentation.
arXiv Detail & Related papers (2022-11-07T14:27:38Z) - SINGA-Easy: An Easy-to-Use Framework for MultiModal Analysis [18.084628500554462]
We introduce SINGA-Easy, a new deep learning framework that provides distributed hyper- parameter tuning at the training stage, dynamic computational cost control at the inference stage, and intuitive user interactions with multimedia contents facilitated by model explanation.
Our experiments on the training and deployment of multi-modality data analysis applications show that the framework is both usable and adaptable to dynamic inference loads.
arXiv Detail & Related papers (2021-08-03T08:39:54Z) - MultiBench: Multiscale Benchmarks for Multimodal Representation Learning [87.23266008930045]
MultiBench is a systematic and unified benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
It provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation.
It introduces impactful challenges for future research, including robustness to large-scale multimodal datasets and robustness to realistic imperfections.
arXiv Detail & Related papers (2021-07-15T17:54:36Z) - The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset:
Collection, Insights and Improvements [14.707930573950787]
We present MuSe-CaR, a first of its kind multimodal dataset.
The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge.
arXiv Detail & Related papers (2021-01-15T10:40:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.