MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
- URL: http://arxiv.org/abs/2303.16839v3
- Date: Wed, 9 Aug 2023 05:39:34 GMT
- Title: MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
- Authors: Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei
Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia
Angelova
- Abstract summary: We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
- Score: 59.09343552273045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of language models have moved from encoder-decoder to
decoder-only designs. In addition, we observe that the two most popular
multimodal tasks, the generative and contrastive tasks, are nontrivial to
accommodate in one architecture, and further need adaptations for downstream
tasks. We propose a novel paradigm of training with a decoder-only model for
multimodal tasks, which is surprisingly effective in jointly learning of these
disparate vision-language tasks. This is done with a simple model, called
MaMMUT. It consists of a single vision encoder and a text decoder, and is able
to accommodate contrastive and generative learning by a novel two-pass approach
on the text decoder. We demonstrate that joint learning of these diverse
objectives is simple, effective, and maximizes the weight-sharing of the model
across these tasks. Furthermore, the same architecture enables straightforward
extensions to open-vocabulary object detection and video-language tasks. The
model tackles a diverse range of tasks, while being modest in capacity. Our
model achieves the state of the art on image-text and text-image retrieval,
video question answering and open-vocabulary detection tasks, outperforming
much larger and more extensively trained foundational models. It shows very
competitive results on VQA and Video Captioning, especially considering its
capacity. Ablations confirm the flexibility and advantages of our approach.
Related papers
- Everything is a Video: Unifying Modalities through Next-Frame Prediction [5.720266474212221]
We introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning.
We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components.
Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-video, video-to-text, and audio-to-text.
arXiv Detail & Related papers (2024-11-15T12:59:37Z) - Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions.
We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms.
Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Valley: Video Assistant with Large Language model Enhanced abilitY [41.79449203718827]
We introduce Valley, a Video Assistant with Large Language model Enhanced abilitY.
To empower Valley with video comprehension and instruction-following capabilities, we construct a video instruction dataset.
We employ ChatGPT to facilitate the construction of task-oriented conversation data.
arXiv Detail & Related papers (2023-06-12T16:11:10Z) - A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision [93.90545426665999]
We take a close look at autoregressive decoders for multi-task learning in multimodal computer vision.
A key finding is that a small decoder learned on top of a frozen pretrained encoder works surprisingly well.
It can be seen as teaching a decoder to interact with a pretrained vision model via natural language.
arXiv Detail & Related papers (2023-03-30T13:42:58Z) - MuLTI: Efficient Video-and-Language Understanding with Text-Guided
MultiWay-Sampler and Multiple Choice Modeling [7.737755720567113]
This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model.
We design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules.
We also propose a new pretraining task named Multiple Choice Modeling.
arXiv Detail & Related papers (2023-03-10T05:22:39Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.