I2MVFormer: Large Language Model Generated Multi-View Document
Supervision for Zero-Shot Image Classification
- URL: http://arxiv.org/abs/2212.02291v1
- Date: Mon, 5 Dec 2022 14:11:36 GMT
- Title: I2MVFormer: Large Language Model Generated Multi-View Document
Supervision for Zero-Shot Image Classification
- Authors: Muhammad Ferjad Naeem, Muhammad Gul Zain Ali Khan, Yongqin Xian,
Muhammad Zeshan Afzal, Didier Stricker, Luc Van Gool, Federico Tombari
- Abstract summary: Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks.
Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views.
I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings.
- Score: 108.83932812826521
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have shown that unstructured text (documents) from online
sources can serve as useful auxiliary information for zero-shot image
classification. However, these methods require access to a high-quality source
like Wikipedia and are limited to a single source of information. Large
Language Models (LLM) trained on web-scale text show impressive abilities to
repurpose their learned knowledge for a multitude of tasks. In this work, we
provide a novel perspective on using an LLM to provide text supervision for a
zero-shot image classification model. The LLM is provided with a few text
descriptions from different annotators as examples. The LLM is conditioned on
these examples to generate multiple text descriptions for each class(referred
to as views). Our proposed model, I2MVFormer, learns multi-view semantic
embeddings for zero-shot image classification with these class views. We show
that each text view of a class provides complementary information allowing a
model to learn a highly discriminative class embedding. Moreover, we show that
I2MVFormer is better at consuming the multi-view text supervision from LLM
compared to baseline models. I2MVFormer establishes a new state-of-the-art on
three public benchmark datasets for zero-shot image classification with
unsupervised semantic embeddings.
Related papers
- Large Language Models are Good Prompt Learners for Low-Shot Image Classification [12.053713356249695]
We propose LLaMP, Large Language Models as Prompt learners, that produces adaptive prompts for the CLIP text encoder.
Experiments show that, compared with other state-of-the-art prompt learning methods, LLaMP yields better performance on both zero-shot generalization and few-shot image classification.
arXiv Detail & Related papers (2023-12-07T06:43:34Z) - Videoprompter: an ensemble of foundational models for zero-shot video
understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations.
We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z) - Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - FILM: How can Few-Shot Image Classification Benefit from Pre-Trained
Language Models? [14.582209994281374]
Few-shot learning aims to train models that can be generalized to novel classes with only a few samples.
We propose a novel few-shot learning framework that uses pre-trained language models based on contrastive learning.
arXiv Detail & Related papers (2023-07-09T08:07:43Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions.
We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model.
We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.