What does CLIP know about a red circle? Visual prompt engineering for
VLMs
- URL: http://arxiv.org/abs/2304.06712v2
- Date: Fri, 18 Aug 2023 05:49:47 GMT
- Title: What does CLIP know about a red circle? Visual prompt engineering for
VLMs
- Authors: Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi
- Abstract summary: We explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text.
We show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks.
- Score: 116.8806079598019
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale Vision-Language Models, such as CLIP, learn powerful image-text
representations that have found numerous applications, from zero-shot
classification to text-to-image generation. Despite that, their capabilities
for solving novel discriminative tasks via prompting fall behind those of large
language models, such as GPT-3. Here we explore the idea of visual prompt
engineering for solving computer vision tasks beyond classification by editing
in image space instead of text. In particular, we discover an emergent ability
of CLIP, where, by simply drawing a red circle around an object, we can direct
the model's attention to that region, while also maintaining global
information. We show the power of this simple approach by achieving
state-of-the-art in zero-shot referring expressions comprehension and strong
performance in keypoint localization tasks. Finally, we draw attention to some
potential ethical concerns of large language-vision models.
Related papers
- ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization [0.0]
We propose a two-stage training method to enhance visual performance and use contrastive learning to mine challenging samples.
We validate the effectiveness of the proposed strategy on several large-scale visual geo-localization datasets.
arXiv Detail & Related papers (2024-06-04T02:28:51Z) - Re-Thinking Inverse Graphics With Large Language Models [51.333105116400205]
Inverse graphics -- inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics.
We propose the Inverse-Graphics Large Language Model (IG-LLM), an inversegraphics framework centered around an LLM.
We incorporate a frozen pre-trained visual encoder and a continuous numeric head to enable end-to-end training.
arXiv Detail & Related papers (2024-04-23T16:59:02Z) - ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts [38.59120110371588]
We introduce a novel multimodal model capable of decoding arbitrary visual prompts.
This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow"
Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings.
arXiv Detail & Related papers (2023-12-01T18:59:56Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z) - APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot
Remote Sensing Image Generalization using CLIP [12.73827827842155]
We propose a novel image-conditioned prompt learning strategy called the Visual Attention conditioned Prompts Learning Network (APPLeNet)
APPLeNet emphasizes the importance of multi-scale feature learning in RS scene classification and disentangles visual style and content primitives for domain generalization tasks.
Our results consistently outperform the relevant literature and code is available at https://github.com/mainaksingha01/APPLeNet.
arXiv Detail & Related papers (2023-04-12T17:20:37Z) - Z-LaVI: Zero-Shot Language Solver Fueled by Visual Imagination [57.49336064527538]
We develop a novel approach, Z-LaVI, to endow language models with visual imagination capabilities.
We leverage two complementary types of ''imaginations'': (i) recalling existing images through retrieval and (ii) synthesizing nonexistent images via text-to-image generation.
Jointly exploiting the language inputs and the imagination, a pretrained vision-language model eventually composes a zero-shot solution to the original language tasks.
arXiv Detail & Related papers (2022-10-21T21:33:10Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.