4M: Massively Multimodal Masked Modeling
- URL: http://arxiv.org/abs/2312.06647v1
- Date: Mon, 11 Dec 2023 18:57:35 GMT
- Title: 4M: Massively Multimodal Masked Modeling
- Authors: David Mizrahi, Roman Bachmann, O\u{g}uzhan Fatih Kar, Teresa Yeo,
Mingfei Gao, Afshin Dehghan, Amir Zamir
- Abstract summary: Current machine learning models for vision are often highly specialized and limited to a single modality and task.
Recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision.
We propose a multimodal training scheme called 4M for training versatile and scalable foundation models for vision tasks.
- Score: 20.69496647914175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current machine learning models for vision are often highly specialized and
limited to a single modality and task. In contrast, recent large language
models exhibit a wide range of capabilities, hinting at a possibility for
similarly versatile models in computer vision. In this paper, we take a step in
this direction and propose a multimodal training scheme called 4M. It consists
of training a single unified Transformer encoder-decoder using a masked
modeling objective across a wide range of input/output modalities - including
text, images, geometric, and semantic modalities, as well as neural network
feature maps. 4M achieves scalability by unifying the representation space of
all modalities through mapping them into discrete tokens and performing
multimodal masked modeling on a small randomized subset of tokens.
4M leads to models that exhibit several key capabilities: (1) they can
perform a diverse set of vision tasks out of the box, (2) they excel when
fine-tuned for unseen downstream tasks or new input modalities, and (3) they
can function as a generative model that can be conditioned on arbitrary
modalities, enabling a wide variety of expressive multimodal editing
capabilities with remarkable flexibility.
Through experimental analyses, we demonstrate the potential of 4M for
training versatile and scalable foundation models for vision tasks, setting the
stage for further exploration in multimodal learning for vision and other
domains.
Related papers
- 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities [17.374241865041856]
We show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance.
We successfully scale the training to a three billion parameter model using tens of modalities and different datasets.
The resulting models and training code are open sourced at 4m.epfl.ch.
arXiv Detail & Related papers (2024-06-13T17:59:42Z) - Self-supervised Pre-training for Transferable Multi-modal Perception [15.93440465377068]
NeRF-Supervised Masked Auto (NS-MAE) is a self-supervised pre-training paradigm for transferable multi-modal representation learning.
Our approach uses masked multi-modal reconstruction in neural radiance fields (NeRF), training the model to reconstruct missing or corrupted input data.
Extensive experiments demonstrate the promising transferability of NS-MAE representations across diverse multi-modal and single-modal perception models.
arXiv Detail & Related papers (2024-05-28T08:13:49Z) - SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation [61.392147185793476]
We present a unified and versatile foundation model, namely, SEED-X.
SEED-X is able to model multi-granularity visual semantics for comprehension and generation tasks.
We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
arXiv Detail & Related papers (2024-04-22T17:56:09Z) - VL-Mamba: Exploring State Space Models for Multimodal Learning [22.701028299912398]
In this work, we propose VL-Mamba, a multimodal large language model based on state space models.
Specifically, we first replace the transformer-based backbone language model such as LLama or Vicuna with the pre-trained Mamba language model.
arXiv Detail & Related papers (2024-03-20T13:48:50Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual words, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized
Multimodal Framework [51.01581167257862]
UnifiedVisionGPT is a novel framework designed to consolidate and automate the integration of SOTA vision models.
This paper outlines the architecture and capabilities of UnifiedVisionGPT, demonstrating its potential to revolutionize the field of computer vision.
arXiv Detail & Related papers (2023-11-16T13:01:25Z) - MultiViz: An Analysis Benchmark for Visualizing and Understanding
Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages.
We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.