InfMLLM: A Unified Framework for Visual-Language Tasks
- URL: http://arxiv.org/abs/2311.06791v2
- Date: Wed, 6 Dec 2023 11:06:06 GMT
- Title: InfMLLM: A Unified Framework for Visual-Language Tasks
- Authors: Qiang Zhou, Zhibin Wang, Wei Chu, Yinghui Xu, Hao Li, Yuan Qi
- Abstract summary: multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
- Score: 44.29407348046122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have proven their remarkable versatility in
handling a comprehensive range of language-centric applications. To expand
LLMs' capabilities to a broader spectrum of modal inputs, multimodal large
language models (MLLMs) have attracted growing interest. This work delves into
enabling LLMs to tackle more vision-language-related tasks, particularly image
captioning, visual question answering (VQA,) and visual grounding. To this end,
we implemented a three-stage training scheme: starting with lightweight
alignment pretraining, then moderate-weight multitask hybrid training, and
finally, LLM fine-tuning to improve instruction following capability.
Throughout the training process, the requirements on GPU memory gradually
increase. To effectively manage the number of visual embeddings passed to the
LLM while preserving their positional information, we introduce a
straightforward visual adapter module dubbed pool-adapter. Our experiments
demonstrate that preserving the positional information of visual embeddings
through the pool-adapter is particularly beneficial for tasks like visual
grounding. We name our proposed approach InfMLLM and have evaluated it
extensively on various benchmark datasets. Our results demonstrate that InfMLLM
achieves either state-of-the-art (SOTA) performance or performance comparable
to recent MLLMs. The code and model will be made open-source at:
\url{https://github.com/mightyzau/InfMLLM}.
Related papers
- LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs [40.74693126923826]
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities.
Training adapters with image-level supervision often results in significant misalignment.
We introduce Supervised Embedding Alignment (SEA), a token-level alignment method that leverages vision-language pre-trained models.
arXiv Detail & Related papers (2024-08-21T17:58:02Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - CLAMP: Contrastive LAnguage Model Prompt-tuning [89.96914454453791]
We show that large language models can achieve good image classification performance when adapted this way.
Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model.
arXiv Detail & Related papers (2023-12-04T05:13:59Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.