PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
- URL: http://arxiv.org/abs/2402.08657v1
- Date: Tue, 13 Feb 2024 18:39:18 GMT
- Title: PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs
- Authors: Michael Dorkenwald, Nimrod Barazani, Cees G. M. Snoek, Yuki M. Asano
- Abstract summary: Vision-Language Models (VLMs) have shown immense potential by integrating large language models with vision systems.
These models face challenges in the fundamental computer vision task of object localisation, due to their training on multimodal data containing mostly captions.
We introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM.
Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads.
- Score: 55.8550939439138
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision-Language Models (VLMs), such as Flamingo and GPT-4V, have shown
immense potential by integrating large language models with vision systems.
Nevertheless, these models face challenges in the fundamental computer vision
task of object localisation, due to their training on multimodal data
containing mostly captions without explicit spatial grounding. While it is
possible to construct custom, supervised training pipelines with bounding box
annotations that integrate with VLMs, these result in specialized and
hard-to-scale models. In this paper, we aim to explore the limits of
caption-based VLMs and instead propose to tackle the challenge in a simpler
manner by i) keeping the weights of a caption-based VLM frozen and ii) not
using any supervised detection data. To this end, we introduce an
input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing
a minimal set of parameters that are slid inside the frozen VLM, unlocking
object localisation capabilities. Our PIN module is trained with a simple
next-token prediction task on synthetic data without requiring the introduction
of new output heads. Our experiments demonstrate strong zero-shot localisation
performances on a variety of images, including Pascal VOC, COCO, LVIS, and
diverse images like paintings or cartoons.
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - Large Language Models Understand Layout [6.732578061359833]
Large language models (LLMs) demonstrate extraordinary abilities in a wide range of natural language processing (NLP) tasks.
We show that, beyond text understanding capability, LLMs are capable of processing text layouts denoted by spatial markers.
We show that layout understanding ability is beneficial for building efficient visual question-answering (VQA) systems.
arXiv Detail & Related papers (2024-07-08T09:03:12Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Q-GroundCAM: Quantifying Grounding in Vision Language Models via GradCAM [3.2688425993442696]
Many probing studies have revealed that even the best-performing Vision and Language Models (VLMs) struggle to capture aspects of compositional scene understanding.
Recent VLM advancements include scaling up both model and dataset sizes, additional training objectives and levels of supervision.
This paper introduces a novel suite of quantitative metrics that utilize GradCAM activations to rigorously evaluate the grounding capabilities of pre-trained VLMs.
arXiv Detail & Related papers (2024-04-29T22:06:17Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models [30.20915403608803]
Griffon is a language-prompted localization dataset for large vision language models.
It is trained end-to-end through a well-designed pipeline.
It achieves state-of-the-art performance on the fine-grained RefCOCO series and Flickr30K Entities.
arXiv Detail & Related papers (2023-11-24T15:35:07Z) - Incorporating Structured Representations into Pretrained Vision &
Language Models Using Scene Graphs [79.64891686479213]
We show that it is possible to improve vision and language models (VLMs) when learning from scene graphs (SGs)
For the visual side, we incorporate a special "SG Component" in the image transformer trained to predict SG information, while for the textual side, we utilize SGs to generate fine-grained captions.
Our method improves the performance of several popular VLMs on multiple datasets with only a mild degradation in ZS capabilities.
arXiv Detail & Related papers (2023-05-10T17:52:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.