Vision-Language Models for Vision Tasks: A Survey
- URL: http://arxiv.org/abs/2304.00685v2
- Date: Fri, 16 Feb 2024 10:28:12 GMT
- Title: Vision-Language Models for Vision Tasks: A Survey
- Authors: Jingyi Zhang, Jiaxing Huang, Sheng Jin and Shijian Lu
- Abstract summary: Vision-Language Models (VLMs) learn rich vision-language correlation from web-scale image-text pairs.
This paper provides a systematic review of visual language models for various visual recognition tasks.
- Score: 62.543250338410836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most visual recognition studies rely heavily on crowd-labelled data in deep
neural networks (DNNs) training, and they usually train a DNN for each single
visual recognition task, leading to a laborious and time-consuming visual
recognition paradigm. To address the two challenges, Vision-Language Models
(VLMs) have been intensively investigated recently, which learns rich
vision-language correlation from web-scale image-text pairs that are almost
infinitely available on the Internet and enables zero-shot predictions on
various visual recognition tasks with a single VLM. This paper provides a
systematic review of visual language models for various visual recognition
tasks, including: (1) the background that introduces the development of visual
recognition paradigms; (2) the foundations of VLM that summarize the
widely-adopted network architectures, pre-training objectives, and downstream
tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4)
the review and categorization of existing VLM pre-training methods, VLM
transfer learning methods, and VLM knowledge distillation methods; (5) the
benchmarking, analysis and discussion of the reviewed methods; (6) several
research challenges and potential research directions that could be pursued in
the future VLM studies for visual recognition. A project associated with this
survey has been created at https://github.com/jingyi0000/VLM_survey.
Related papers
- VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models [19.291697178628546]
Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks.
In this work, we propose an eye examination process to investigate how a VLM perceives images.
arXiv Detail & Related papers (2024-09-23T07:15:29Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs [83.24033574914425]
We present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving.
Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information.
Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks.
arXiv Detail & Related papers (2024-06-20T17:54:03Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [11.786387517781328]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering.
Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.
We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.