Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants
- URL: http://arxiv.org/abs/2310.00653v1
- Date: Sun, 1 Oct 2023 12:35:18 GMT
- Title: Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants
- Authors: Tianyu Yu, Jinyi Hu, Yuan Yao, Haoye Zhang, Yue Zhao, Chongyi Wang,
Shan Wang, Yinxv Pan, Jiao Xue, Dahai Li, Zhiyuan Liu, Hai-Tao Zheng, Maosong
Sun
- Abstract summary: Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
- Score: 65.47222691674074
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent Multimodal Large Language Models (MLLMs) exhibit impressive abilities
to perceive images and follow open-ended instructions. The capabilities of
MLLMs depend on two crucial factors: the model architecture to facilitate the
feature alignment of visual modules and large language models; the multimodal
instruction tuning datasets for human instruction following. (i) For the model
architecture, most existing models introduce an external bridge module to
connect vision encoders with language models, which needs an additional
feature-alignment pre-training. In this work, we discover that compact
pre-trained vision language models can inherently serve as ``out-of-the-box''
bridges between vision and language. Based on this, we propose Muffin
framework, which directly employs pre-trained vision-language models to act as
providers of visual signals. (ii) For the multimodal instruction tuning
datasets, existing methods omit the complementary relationship between
different datasets and simply mix datasets from different tasks. Instead, we
propose UniMM-Chat dataset which explores the complementarities of datasets to
generate 1.1M high-quality and diverse multimodal instructions. We merge
information describing the same image from diverse datasets and transforms it
into more knowledge-intensive conversation data. Experimental results
demonstrate the effectiveness of the Muffin framework and UniMM-Chat dataset.
Muffin achieves state-of-the-art performance on a wide range of vision-language
tasks, significantly surpassing state-of-the-art models like LLaVA and
InstructBLIP. Our model and dataset are all accessible at
https://github.com/thunlp/muffin.
Related papers
- NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z) - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities.
The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM.
Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.