Related papers: Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models

URL: http://arxiv.org/abs/2402.07865v2
Date: Thu, 30 May 2024 13:08:48 GMT
Title: Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models
Authors: Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, Dorsa Sadigh,
Abstract summary: Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
Score: 73.40350756742231
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning; adoption that has fueled a wealth of new models such as LLaVa, InstructBLIP, and PaLI-3. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored, making it challenging to understand what factors account for model performance $-$ a challenge further complicated by the lack of objective, consistent evaluations. To address these gaps, we first compile a suite of standardized evaluations spanning visual question answering, object localization, and challenge sets that probe properties such as hallucination; evaluations that provide fine-grained insight VLM capabilities. Second, we rigorously investigate VLMs along key design axes, including pretrained visual representations and training from base vs. instruct-tuned language models, amongst others. We couple our analysis with three resource contributions: (1) a unified framework for evaluating VLMs, (2) optimized, flexible training code, and (3) checkpoints for all models, including a family of VLMs at the 7-13B scale that strictly outperform InstructBLIP and LLaVa v1.5, the state-of-the-art in open VLMs.

Related papers

Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding [39.64540328712615]
Vision-Language Models (VLMs) have demonstrated impressive world knowledge across a wide range of tasks, making them promising candidates for embodied reasoning applications.<n>Existing benchmarks primarily evaluate the embodied reasoning ability of VLMs through multiple-choice questions based on image annotations.<n>We introduce the Point-It-Out benchmark, a novel benchmark designed to systematically assess the embodied reasoning abilities of VLMs through precise visual grounding.
arXiv Detail & Related papers (2025-09-30T05:05:54Z)
Hidden in plain sight: VLMs overlook their visual representations [48.83628674170634]
We compare vision language models (VLMs) to their visual encoders to understand their ability to integrate across these modalities.<n>We find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance.
arXiv Detail & Related papers (2025-06-09T17:59:54Z)
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z)
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 3.49 million questions and 3.32 million images. Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives. We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z)
Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation [38.20492321295552]
Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated.
arXiv Detail & Related papers (2025-04-13T08:28:13Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs [4.381263829108405]
Vision-Language Models (VLMs) are known to struggle with spatial reasoning and visual alignment. We introduce iVISPAR, an interactive multi-modal benchmark designed to evaluate the spatial reasoning capabilities of VLMs acting as agents.
arXiv Detail & Related papers (2025-02-05T14:29:01Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. These models have been shown to be highly capable, but also lacking some basic visual understanding skills. This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z)
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs) One understudied capability inVLMs is visual spatial planning. Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs [83.24033574914425]
We present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks.
arXiv Detail & Related papers (2024-06-20T17:54:03Z)
Vi(E)va LLM! A Conceptual Stack for Evaluating and Interpreting Generative AI-based Visualizations [1.709620026135923]
Large language models (LLM) have become an interesting option for supporting generative tasks related to visualization. This paper copes with the problem of modeling the evaluation of a generated visualization through an LLM. We propose a theoretical evaluation stack, EvaLLM, that decomposes the evaluation effort in its atomic components.
arXiv Detail & Related papers (2024-02-03T14:28:55Z)
VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons. We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.