MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep
Learning
- URL: http://arxiv.org/abs/2306.16413v1
- Date: Wed, 28 Jun 2023 17:59:10 GMT
- Title: MultiZoo & MultiBench: A Standardized Toolkit for Multimodal Deep
Learning
- Authors: Paul Pu Liang, Yiwei Lyu, Xiang Fan, Arav Agarwal, Yun Cheng,
Louis-Philippe Morency, Ruslan Salakhutdinov
- Abstract summary: MultiZoo is a public toolkit consisting of standardized implementations of > 20 core multimodal algorithms.
MultiBench is a benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
- Score: 110.54752872873472
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning multimodal representations involves integrating information from
multiple heterogeneous sources of data. In order to accelerate progress towards
understudied modalities and tasks while ensuring real-world robustness, we
release MultiZoo, a public toolkit consisting of standardized implementations
of > 20 core multimodal algorithms and MultiBench, a large-scale benchmark
spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
Together, these provide an automated end-to-end machine learning pipeline that
simplifies and standardizes data loading, experimental setup, and model
evaluation. To enable holistic evaluation, we offer a comprehensive methodology
to assess (1) generalization, (2) time and space complexity, and (3) modality
robustness. MultiBench paves the way towards a better understanding of the
capabilities and limitations of multimodal models, while ensuring ease of use,
accessibility, and reproducibility. Our toolkits are publicly available, will
be regularly updated, and welcome inputs from the community.
Related papers
- VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning [22.27364585438247]
VSearcher is a multimodal search agent capable of long-horizon, multi-turn tool use in real-world web environments.<n>We introduce Iterative Injection Data Synthesis pipeline to generate large-scale, complex multimodal QA questions.<n>We then adopt an SFT-then-RL training pipeline to turn base multimodal models to agent capable of multi-turn tool calling in real-world web environments.
arXiv Detail & Related papers (2026-03-03T09:33:22Z) - BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents [30.849897676091327]
Multimodal large language models (MLLMs) are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.<n>We introduce BrowseComp-$V3$, a novel benchmark consisting of 300 carefully curated and challenging questions spanning diverse domains.<n>Our results highlight a fundamental gap between current model capabilities and robust multimodal deep search in real-world settings.
arXiv Detail & Related papers (2026-02-13T12:25:13Z) - MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains [35.511656323075506]
We have developed a large-scale, domain-adaptive benchmark for multimodal evaluation.<n>This benchmark integrates over 30 datasets, encompassing 15 modalities and 20 predictive tasks.<n>We have also developed an open-source, unified, and automated evaluation pipeline.
arXiv Detail & Related papers (2025-11-09T16:37:09Z) - Emerging Properties in Unified Multimodal Pretraining [32.856334401494145]
We introduce BAGEL, an open-source foundational model that supports multimodal understanding and generation.<n>BAGEL is a unified, decoder-only model pretrained on trillions of tokens curated from large-scale interleaved text, image, video, and web data.<n>It significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks.
arXiv Detail & Related papers (2025-05-20T17:59:30Z) - Multi-modal Time Series Analysis: A Tutorial and Survey [36.93906365779472]
Multi-modal time series analysis has emerged as a prominent research area in data mining.
However, effective analysis of multi-modal time series is hindered by data heterogeneity, modality gap, misalignment, and inherent noise.
Recent advancements in multi-modal time series methods have exploited the multi-modal context via cross-modal interactions.
arXiv Detail & Related papers (2025-03-17T20:30:02Z) - VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models [89.63342806812413]
We present an open-source toolkit for evaluating large multi-modality models based on PyTorch.
VLMEvalKit implements over 70 different large multi-modality models, including both proprietary APIs and open-source models.
We host OpenVLM Leaderboard to track the progress of multi-modality learning research.
arXiv Detail & Related papers (2024-07-16T13:06:15Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Generalized Product-of-Experts for Learning Multimodal Representations
in Noisy Environments [18.14974353615421]
We propose a novel method for multimodal representation learning in a noisy environment via the generalized product of experts technique.
In the proposed method, we train a separate network for each modality to assess the credibility of information coming from that modality.
We attain state-of-the-art performance on two challenging benchmarks: multimodal 3D hand-pose estimation and multimodal surgical video segmentation.
arXiv Detail & Related papers (2022-11-07T14:27:38Z) - SINGA-Easy: An Easy-to-Use Framework for MultiModal Analysis [18.084628500554462]
We introduce SINGA-Easy, a new deep learning framework that provides distributed hyper- parameter tuning at the training stage, dynamic computational cost control at the inference stage, and intuitive user interactions with multimedia contents facilitated by model explanation.
Our experiments on the training and deployment of multi-modality data analysis applications show that the framework is both usable and adaptable to dynamic inference loads.
arXiv Detail & Related papers (2021-08-03T08:39:54Z) - MultiBench: Multiscale Benchmarks for Multimodal Representation Learning [87.23266008930045]
MultiBench is a systematic and unified benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
It provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation.
It introduces impactful challenges for future research, including robustness to large-scale multimodal datasets and robustness to realistic imperfections.
arXiv Detail & Related papers (2021-07-15T17:54:36Z) - The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset:
Collection, Insights and Improvements [14.707930573950787]
We present MuSe-CaR, a first of its kind multimodal dataset.
The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge.
arXiv Detail & Related papers (2021-01-15T10:40:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.