Related papers: What does a platypus look like? Generating customized prompts for zero-shot image classification

What does a platypus look like? Generating customized prompts for zero-shot image classification

URL: http://arxiv.org/abs/2209.03320v3
Date: Sun, 3 Dec 2023 22:44:05 GMT
Title: What does a platypus look like? Generating customized prompts for zero-shot image classification
Authors: Sarah Pratt, Ian Covert, Rosanne Liu, Ali Farhadi
Abstract summary: This work introduces a simple method to generate higher accuracy prompts without relying on any explicit knowledge of the task domain. We leverage the knowledge contained in large language models (LLMs) to generate many descriptive sentences that contain important discriminating characteristics of the image categories. This approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet.
Score: 52.92839995002636
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary models are a promising new paradigm for image classification. Unlike traditional classification models, open-vocabulary models classify among any arbitrary set of categories specified with natural language during inference. This natural language, called "prompts", typically consists of a set of hand-written templates (e.g., "a photo of a {}") which are completed with each of the category names. This work introduces a simple method to generate higher accuracy prompts, without relying on any explicit knowledge of the task domain and with far fewer hand-constructed sentences. To achieve this, we combine open-vocabulary models with large language models (LLMs) to create Customized Prompts via Language models (CuPL, pronounced "couple"). In particular, we leverage the knowledge contained in LLMs in order to generate many descriptive sentences that contain important discriminating characteristics of the image categories. This allows the model to place a greater importance on these regions in the image when making predictions. We find that this straightforward and general approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet. Finally, this simple baseline requires no additional training and remains completely zero-shot. Code available at https://github.com/sarahpratt/CuPL.

Related papers

TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models. Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features. Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z)
Generating customized prompts for Zero-Shot Rare Event Medical Image Classification using LLM [7.133750565011626]
This paper introduces a simple yet effective method for generating highly accurate and contextually descriptive prompts. We propose a novel approach that uses domain-specific expert knowledge on rare events to generate customized and contextually relevant prompts. Our method enhances rare event classification without additional training, outperforming state-of-the-art techniques.
arXiv Detail & Related papers (2025-01-27T20:28:01Z)
Grounding Descriptions in Images informs Zero-Shot Visual Recognition [47.66166611138081]
We propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously. We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods.
arXiv Detail & Related papers (2024-12-05T18:52:00Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
ITI-GEN: Inclusive Text-to-Image Generation [56.72212367905351]
This study investigates inclusive text-to-image generative models that generate images based on human-written prompts. We show that, for some attributes, images can represent concepts more expressively than text. We propose a novel approach, ITI-GEN, that leverages readily available reference images for Inclusive Text-to-Image GENeration.
arXiv Detail & Related papers (2023-09-11T15:54:30Z)
GIST: Generating Image-Specific Text for Fine-grained Object Classification [8.118079247462425]
GIST is a method for generating image-specific fine-grained text descriptions from image-only datasets. Our method achieves an average improvement of $4.1%$ in accuracy over CLIP linear probes.
arXiv Detail & Related papers (2023-07-21T02:47:18Z)
Text Descriptions are Compressive and Invariant Representations for Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting. In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors). This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z)
Freestyle Layout-to-Image Synthesis [42.64485133926378]
In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics onto a given layout. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs.
arXiv Detail & Related papers (2023-03-25T09:37:41Z)
Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions. We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model. We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z)
One-bit Supervision for Image Classification [121.87598671087494]
One-bit supervision is a novel setting of learning from incomplete annotations. We propose a multi-stage training paradigm which incorporates negative label suppression into an off-the-shelf semi-supervised learning algorithm.
arXiv Detail & Related papers (2020-09-14T03:06:23Z)
Revisiting Pose-Normalization for Fine-Grained Few-Shot Recognition [46.15360203412185]
Few-shot, fine-grained classification requires a model to learn subtle, fine-grained distinctions between different classes. A solution is to use pose-normalized representations. We show that they are extremely effective for few-shot fine-grained classification.
arXiv Detail & Related papers (2020-04-01T21:00:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.