Otter: A Multi-Modal Model with In-Context Instruction Tuning
- URL: http://arxiv.org/abs/2305.03726v1
- Date: Fri, 5 May 2023 17:59:46 GMT
- Title: Otter: A Multi-Modal Model with In-Context Instruction Tuning
- Authors: Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, Ziwei
Liu
- Abstract summary: We introduce instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset.
We then introduce Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following ability and in-context learning.
- Score: 30.804061018682244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated significant universal
capabilities as few/zero-shot learners in various tasks due to their
pre-training on vast amounts of text data, as exemplified by GPT-3, which
boosted to InstrctGPT and ChatGPT, effectively following natural language
instructions to accomplish real-world tasks. In this paper, we propose to
introduce instruction tuning into multi-modal models, motivated by the Flamingo
model's upstream interleaved format pretraining dataset. We adopt a similar
approach to construct our MultI-Modal In-Context Instruction Tuning (MIMIC-IT)
dataset. We then introduce Otter, a multi-modal model based on OpenFlamingo
(open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and
showcasing improved instruction-following ability and in-context learning. We
also optimize OpenFlamingo's implementation for researchers, democratizing the
required training resources from 1$\times$ A100 GPU to 4$\times$ RTX-3090 GPUs,
and integrate both OpenFlamingo and Otter into Huggingface Transformers for
more researchers to incorporate the models into their customized training and
inference pipelines.
Related papers
- EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Generative Visual Instruction Tuning [11.727612242016871]
We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model.
We produce GenLLaVA, a Generative Large Language and Visual Assistant.
Our model demonstrates visual understanding capabilities superior to LLaVA and demonstrates competitive results with native multimodal models.
arXiv Detail & Related papers (2024-06-17T07:06:58Z) - AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.
We build a multimodal text-centric dataset for multimodal alignment pre-training.
We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - InstructBLIP: Towards General-purpose Vision-Language Models with
Instruction Tuning [43.54069813039309]
We study vision-language instruction tuning based on the pretrained BLIP-2 models.
InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets.
Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks.
arXiv Detail & Related papers (2023-05-11T00:38:10Z) - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities.
The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM.
Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.