Visual Instruction Tuning towards General-Purpose Multimodal Model: A
Survey
- URL: http://arxiv.org/abs/2312.16602v1
- Date: Wed, 27 Dec 2023 14:54:37 GMT
- Title: Visual Instruction Tuning towards General-Purpose Multimodal Model: A
Survey
- Authors: Jiaxing Huang, Jingyi Zhang, Kai Jiang, Han Qiu and Shijian Lu
- Abstract summary: Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture.
Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions.
This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; and (3) the commonly used datasets in visual instruction tuning and evaluation.
- Score: 59.95153883166705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional computer vision generally solves each single task independently
by a dedicated model with the task instruction implicitly designed in the model
architecture, arising two limitations: (1) it leads to task-specific models,
which require multiple models for different tasks and restrict the potential
synergies from diverse tasks; (2) it leads to a pre-defined and fixed model
interface that has limited interactivity and adaptability in following user'
task instructions. To address them, Visual Instruction Tuning (VIT) has been
intensively studied recently, which finetunes a large vision model with
language as task instructions, aiming to learn from a wide range of vision
tasks described by language instructions a general-purpose multimodal model
that can follow arbitrary instructions and thus solve arbitrary tasks specified
by the user. This work aims to provide a systematic review of visual
instruction tuning, covering (1) the background that presents computer vision
task paradigms and the development of VIT; (2) the foundations of VIT that
introduce commonly used network architectures, visual instruction tuning
frameworks and objectives, and evaluation setups and tasks; (3) the commonly
used datasets in visual instruction tuning and evaluation; (4) the review of
existing VIT methods that categorizes them with a taxonomy according to both
the studied vision task and the method design and highlights the major
contributions, strengths, and shortcomings of them; (5) the comparison and
discussion of VIT methods over various instruction-following benchmarks; (6)
several challenges, open directions and possible future works in visual
instruction tuning research.
Related papers
- VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs)
VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks.
We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z) - Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning [53.93074108238167]
We construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date.
We propose a two-stage instruction tuning framework, in which VLMs are finetuned on Vision-Flan and further tuned on GPT-4 synthesized data.
We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework.
arXiv Detail & Related papers (2024-02-18T19:38:44Z) - Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text.
A Vision-Language-Consistency Analysis of VLLMs and Beyond [7.760124498553333]
We study whether vision-language models execute vision and language tasks consistently or independently.
We introduce a systematic framework that quantifies the capability disparities between different modalities in the multi-modal setting.
We introduce "Vision Description Prompting," a method that effectively improves performance in challenging vision-related tasks.
arXiv Detail & Related papers (2023-10-19T06:45:11Z) - MiniGPT-v2: large language model as a unified interface for
vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks.
We propose using unique identifiers for different tasks when training the model.
Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - Prompt Tuning with Soft Context Sharing for Vision-Language Models [42.61889428498378]
We propose a novel method to tune pre-trained vision-language models on multiple target few-shot tasks jointly.
We show that SoftCPT significantly outperforms single-task prompt tuning methods.
arXiv Detail & Related papers (2022-08-29T10:19:10Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.