Related papers: Improved Few-Shot Image Classification Through Multiple-Choice Questions

Improved Few-Shot Image Classification Through Multiple-Choice Questions

URL: http://arxiv.org/abs/2407.16145v1
Date: Tue, 23 Jul 2024 03:09:42 GMT
Title: Improved Few-Shot Image Classification Through Multiple-Choice Questions
Authors: Dipika Khullar, Emmett Goodman, Negin Sokhandan,
Abstract summary: We propose a simple method to boost VQA performance for image classification using only a handful of labeled examples and a multiple-choice question. We demonstrate this method outperforms both pure visual encoders and zero-shot VQA baselines to achieve impressive performance on common few-shot tasks.
Score: 1.4432605069307167
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Through a simple multiple choice language prompt a VQA model can operate as a zero-shot image classifier, producing a classification label. Compared to typical image encoders, VQA models offer an advantage: VQA-produced image embeddings can be infused with the most relevant visual information through tailored language prompts. Nevertheless, for most tasks, zero-shot VQA performance is lacking, either because of unfamiliar category names, or dissimilar pre-training data and test data distributions. We propose a simple method to boost VQA performance for image classification using only a handful of labeled examples and a multiple-choice question. This few-shot method is training-free and maintains the dynamic and flexible advantages of the VQA model. Rather than relying on the final language output, our approach uses multiple-choice questions to extract prompt-specific latent representations, which are enriched with relevant visual information. These representations are combined to create a final overall image embedding, which is decoded via reference to latent class prototypes constructed from the few labeled examples. We demonstrate this method outperforms both pure visual encoders and zero-shot VQA baselines to achieve impressive performance on common few-shot tasks including MiniImageNet, Caltech-UCSD Birds, and CIFAR-100. Finally, we show our approach does particularly well in settings with numerous diverse visual attributes such as the fabric, article-style, texture, and view of different articles of clothing, where other few-shot approaches struggle, as we can tailor our image representations only on the semantic features of interest.

Related papers

The Power of One: A Single Example is All it Takes for Segmentation in VLMs [29.735863112700358]
Large-scale vision-language models (VLMs) exhibit strong multimodal understanding capabilities by implicitly learning associations between textual descriptions and image regions. This emergent ability enables zero-shot object detection and segmentation, using techniques that rely on text-image attention maps. We show that this approach yields strong zero-shot performance, further enhanced through fine-tuning with a single visual example.
arXiv Detail & Related papers (2025-03-13T18:18:05Z)
ABC: Achieving Better Control of Multimodal Embeddings using VLMs [61.396457715710774]
Visual embedding models excel at zero-shot tasks like visual retrieval and classification. Existing CLIP-based approaches embed images and text independently, and fuse the result. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone.
arXiv Detail & Related papers (2025-03-01T03:29:02Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
Self-Supervised Open-Ended Classification with Small Visual Language Models [60.23212389067007]
We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models. By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe.
arXiv Detail & Related papers (2023-09-30T21:41:21Z)
Text Descriptions are Compressive and Invariant Representations for Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting. In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors). This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z)
LPN: Language-guided Prototypical Network for few-shot classification [16.37959398470535]
Few-shot classification aims to adapt to new tasks with limited labeled examples. Recent methods explore suitable measures for the similarity between the query and support images. We propose a Language-guided Prototypical Network (LPN) for few-shot classification.
arXiv Detail & Related papers (2023-07-04T06:54:01Z)
Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD) We adopt a standard two-stage object detector architecture. We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z)
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images! [103.09776737512077]
SelTDA (Self-Taught Data Augmentation) is a strategy for finetuning large vision language models on small-scale VQA datasets. It generates question-answer pseudolabels directly conditioned on an image, allowing us to pseudolabel unlabeled images. We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions.
arXiv Detail & Related papers (2023-06-06T18:00:47Z)
Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective [93.56647950778357]
Blind image quality assessment (BIQA) predicts the human perception of image quality without any reference information. We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks.
arXiv Detail & Related papers (2023-03-27T07:58:09Z)
Disambiguation of One-Shot Visual Classification Tasks: A Simplex-Based Approach [8.436437583394998]
We present a strategy which aims at detecting the presence of multiple objects in a given shot. This strategy is based on identifying the corners of a simplex in a high dimensional space. We show the ability of the proposed method to slightly, yet statistically significantly, improve accuracy in extreme settings.
arXiv Detail & Related papers (2023-01-16T11:37:05Z)
Training and challenging models for text-guided fashion image retrieval [1.4266272677701561]
We introduce a new evaluation dataset, Challenging Fashion Queries (CFQ) CFQ complements existing benchmarks by including relative captions with positive and negative labels of caption accuracy and conditional image similarity. We demonstrate the importance of multimodal pretraining for the task and show that domain-specific weak supervision based on attribute labels can augment generic large-scale pretraining.
arXiv Detail & Related papers (2022-04-23T06:24:23Z)
Self-Supervised VQA: Answering Visual Questions using Images and Captions [38.05223339919346]
VQA models assume the availability of datasets with human-annotated Image-Question-Answer(I-Q-A) triplets for training. We study whether models can be trained without any human-annotated Q-A pairs, but only with images and associated text captions.
arXiv Detail & Related papers (2020-12-04T01:22:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.