Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
- URL: http://arxiv.org/abs/2412.03565v1
- Date: Wed, 04 Dec 2024 18:58:10 GMT
- Title: Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
- Authors: Wujian Peng, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang,
- Abstract summary: Inst-IT is a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning.
Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm.
- Score: 125.79428219851289
- License:
- Abstract: Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more nuanced comprehension and alignment. Instance-level understanding is crucial, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the state-of-the-art LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we introduce an automated annotation pipeline assisted by GPT-4o to extract instance-level information from images and videos through explicit visual prompting for instance guidance. Building upon this pipeline, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, with the boost of Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our dataset not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.
Related papers
- Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.
Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.
We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - MetaMorph: Multimodal Understanding and Generation via Instruction Tuning [57.35160715164359]
Visual-Predictive Instruction Tuning (VPiT) is a simple and effective extension to visual instruction tuning.
VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data.
We train our MetaMorph model and achieve competitive performance on both visual understanding and generation.
arXiv Detail & Related papers (2024-12-18T18:58:50Z) - In-Context Learning Improves Compositional Understanding of Vision-Language Models [2.762909189433944]
compositional image understanding remains a rather difficult task due to the object bias present in training data.
We compare contrastive models with generative ones and analyze their differences in architecture, pre-training data, and training tasks and losses.
Our proposed approach outperforms baseline models across multiple compositional understanding datasets.
arXiv Detail & Related papers (2024-07-22T09:03:29Z) - InsightSee: Advancing Multi-agent Vision-Language Models for Enhanced Visual Understanding [12.082379948480257]
This paper proposes InsightSee, a multi-agent framework to enhance vision-language models' capabilities in handling complex visual understanding scenarios.
The framework comprises a description agent, two reasoning agents, and a decision agent, which are integrated to refine the process of visual information interpretation.
The proposed framework outperforms state-of-the-art algorithms in 6 out of 9 benchmark tests, with a substantial advancement in multimodal understanding.
arXiv Detail & Related papers (2024-05-31T13:56:55Z) - Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models [42.182009352159]
We present a new efficient LLVM, Mamba-based traversal of rationales (Meteor)
To embed lengthy rationales containing abundant information, we employ the Mamba architecture, capable of processing sequential data with linear time complexity.
Subsequently, the backbone multimodal language model (MLM) is trained to generate answers with the aid of rationale.
arXiv Detail & Related papers (2024-05-24T14:04:03Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning [111.01953096869947]
Visual instruction tuning is crucial for enhancing the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs)
We develop a systematic approach to automatically create high-quality complex visual reasoning instructions.
Experimental results consistently demonstrate the enhanced performance of all compared MLLMs.
arXiv Detail & Related papers (2023-11-02T15:36:12Z) - What Makes for Good Visual Tokenizers for Large Language Models? [26.488269091290597]
We investigate proper pre-training methods to build good visual tokenizers, making Large Language Models (LLMs) powerful Multimodal Large Language Models (MLLMs)
We discuss different visual tokenizers pre-trained with dominant methods (i.e., DeiT, CLIP, MAE, DINO)
We obtain a new MLLM equipped with a tailored Good Visual Tokenizer (GVT), which exhibits strong visual comprehension capability at multiple scales.
arXiv Detail & Related papers (2023-05-20T16:11:26Z) - Dynamic Prompting: A Unified Framework for Prompt Tuning [33.175097465669374]
We present a unified dynamic prompt (DP) tuning strategy that dynamically determines different factors of prompts based on specific tasks and instances.
Experimental results underscore the significant performance improvement achieved by dynamic prompt tuning across a wide range of tasks.
We establish the universal applicability of our approach under full-data, few-shot, and multitask scenarios.
arXiv Detail & Related papers (2023-03-06T06:04:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.