Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts
- URL: http://arxiv.org/abs/2307.11661v2
- Date: Tue, 8 Aug 2023 13:44:12 GMT
- Title: Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts
- Authors: Mayug Maniparambil, Chris Vorster, Derek Molloy, Noel Murphy, Kevin
McGuinness, Noel E. O'Connor
- Abstract summary: GPT-4 can be used to generate text that is visually descriptive.
We show considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets.
- Score: 13.486599520658919
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have
revolutionized visual representation learning by providing good performance on
downstream datasets. VLMs are 0-shot adapted to a downstream dataset by
designing prompts that are relevant to the dataset. Such prompt engineering
makes use of domain expertise and a validation dataset. Meanwhile, recent
developments in generative pretrained models like GPT-4 mean they can be used
as advanced internet search tools. They can also be manipulated to provide
visual information in any structure. In this work, we show that GPT-4 can be
used to generate text that is visually descriptive and how this can be used to
adapt CLIP to downstream tasks. We show considerable improvements in 0-shot
transfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD
(~7%), SUN397 (~4.6%), and CUB (~3.3%) when compared to CLIP's default prompt.
We also design a simple few-shot adapter that learns to choose the best
possible sentences to construct generalizable classifiers that outperform the
recently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized
fine-grained datasets. The code, prompts, and auxiliary text dataset is
available at https://github.com/mayug/VDT-Adapter.
Related papers
- Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning [56.795078085234195]
LLM pruning approaches universally rely on the C4 dataset as the calibration data for calculating pruning scores.
In this study, we evaluate the choice of calibration data on LLM pruning, across a wide range of datasets.
Our results also uncover several subtle and often unexpected findings.
arXiv Detail & Related papers (2024-10-09T22:00:19Z) - Prompt4Vis: Prompting Large Language Models with Example Mining and
Schema Filtering for Tabular Data Visualization [13.425454489560376]
We introduce Prompt4Vis, a framework for generating data visualization queries from natural language.
In-context learning is introduced into the text-to-vis for generating data visualization queries.
Prompt4Vis surpasses the state-of-the-art (SOTA) RGVisNet by approximately 35.9% and 71.3% on dev and test sets, respectively.
arXiv Detail & Related papers (2024-01-29T10:23:47Z) - COCO is "ALL'' You Need for Visual Instruction Fine-tuning [39.438410070172125]
Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions.
Recent studies propose to construct visual IFT datasets through a multifaceted approach.
We establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions.
arXiv Detail & Related papers (2024-01-17T04:43:45Z) - GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? [82.40761196684524]
This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks.
We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds.
Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
arXiv Detail & Related papers (2023-11-27T11:29:10Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - DataComp: In search of the next generation of multimodal datasets [179.79323076587255]
DataComp is a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl.
Our benchmark consists of multiple compute scales spanning four orders of magnitude.
In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet.
arXiv Detail & Related papers (2023-04-27T11:37:18Z) - Visual Instruction Tuning [79.70923292053097]
We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
arXiv Detail & Related papers (2023-04-17T17:59:25Z) - Supervision Exists Everywhere: A Data Efficient Contrastive
Language-Image Pre-training Paradigm [109.0573737034428]
Large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks.
This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP) to alleviate this limitation.
We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently.
arXiv Detail & Related papers (2021-10-11T12:17:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.