Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models
- URL: http://arxiv.org/abs/2308.13437v2
- Date: Thu, 14 Sep 2023 15:00:06 GMT
- Title: Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models
- Authors: Chi Chen, Ruoyu Qin, Fuwen Luo, Xiaoyue Mi, Peng Li, Maosong Sun, Yang
Liu
- Abstract summary: We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
- Score: 50.07056960586183
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Multimodal Large Language Models (MLLMs) that enable Large Language
Models (LLMs) to interpret images through visual instruction tuning have
achieved significant success. However, existing visual instruction tuning
methods only utilize image-language instruction data to align the language and
image modalities, lacking a more fine-grained cross-modal alignment. In this
paper, we propose Position-enhanced Visual Instruction Tuning (PVIT), which
extends the functionality of MLLMs by integrating an additional region-level
vision encoder. This integration promotes a more detailed comprehension of
images for the MLLM. In addition, to efficiently achieve a fine-grained
alignment between the vision modules and the LLM, we design multiple data
generation strategies to construct an image-region-language instruction
dataset. Finally, we present both quantitative experiments and qualitative
analysis that demonstrate the superiority of the proposed model. Code and data
will be released at https://github.com/PVIT-official/PVIT.
Related papers
- EMMA: Efficient Visual Alignment in Multi-Modal LLMs [56.03417732498859]
EMMA is a lightweight cross-modality module designed to efficiently fuse visual and textual encodings.
EMMA boosts performance across multiple tasks by up to 9.3% while significantly improving robustness against hallucinations.
arXiv Detail & Related papers (2024-10-02T23:00:31Z) - MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs [40.74693126923826]
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities.
Training adapters with image-level supervision often results in significant misalignment.
We introduce Supervised Embedding Alignment (SEA), a token-level alignment method that leverages vision-language pre-trained models.
arXiv Detail & Related papers (2024-08-21T17:58:02Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.