4M: Massively Multimodal Masked Modeling
- URL: http://arxiv.org/abs/2312.06647v1
- Date: Mon, 11 Dec 2023 18:57:35 GMT
- Title: 4M: Massively Multimodal Masked Modeling
- Authors: David Mizrahi, Roman Bachmann, O\u{g}uzhan Fatih Kar, Teresa Yeo,
Mingfei Gao, Afshin Dehghan, Amir Zamir
- Abstract summary: Current machine learning models for vision are often highly specialized and limited to a single modality and task.
Recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision.
We propose a multimodal training scheme called 4M for training versatile and scalable foundation models for vision tasks.
- Score: 20.69496647914175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current machine learning models for vision are often highly specialized and
limited to a single modality and task. In contrast, recent large language
models exhibit a wide range of capabilities, hinting at a possibility for
similarly versatile models in computer vision. In this paper, we take a step in
this direction and propose a multimodal training scheme called 4M. It consists
of training a single unified Transformer encoder-decoder using a masked
modeling objective across a wide range of input/output modalities - including
text, images, geometric, and semantic modalities, as well as neural network
feature maps. 4M achieves scalability by unifying the representation space of
all modalities through mapping them into discrete tokens and performing
multimodal masked modeling on a small randomized subset of tokens.
4M leads to models that exhibit several key capabilities: (1) they can
perform a diverse set of vision tasks out of the box, (2) they excel when
fine-tuned for unseen downstream tasks or new input modalities, and (3) they
can function as a generative model that can be conditioned on arbitrary
modalities, enabling a wide variety of expressive multimodal editing
capabilities with remarkable flexibility.
Through experimental analyses, we demonstrate the potential of 4M for
training versatile and scalable foundation models for vision tasks, setting the
stage for further exploration in multimodal learning for vision and other
domains.
Related papers
- VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks [60.5257456681402]
We build universal embedding models capable of handling a wide range of downstream tasks.
Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB.
arXiv Detail & Related papers (2024-10-07T16:14:05Z) - MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens.
MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities [17.374241865041856]
We show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance.
We successfully scale the training to a three billion parameter model using tens of modalities and different datasets.
The resulting models and training code are open sourced at 4m.epfl.ch.
arXiv Detail & Related papers (2024-06-13T17:59:42Z) - VL-Mamba: Exploring State Space Models for Multimodal Learning [22.701028299912398]
In this work, we propose VL-Mamba, a multimodal large language model based on state space models.
Specifically, we first replace the transformer-based backbone language model such as LLama or Vicuna with the pre-trained Mamba language model.
arXiv Detail & Related papers (2024-03-20T13:48:50Z) - Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes.
Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques.
We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized
Multimodal Framework [51.01581167257862]
UnifiedVisionGPT is a novel framework designed to consolidate and automate the integration of SOTA vision models.
This paper outlines the architecture and capabilities of UnifiedVisionGPT, demonstrating its potential to revolutionize the field of computer vision.
arXiv Detail & Related papers (2023-11-16T13:01:25Z) - MultiViz: An Analysis Benchmark for Visualizing and Understanding
Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages.
We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.