LPN: Language-guided Prototypical Network for few-shot classification
- URL: http://arxiv.org/abs/2307.01515v3
- Date: Sat, 21 Oct 2023 10:17:18 GMT
- Title: LPN: Language-guided Prototypical Network for few-shot classification
- Authors: Kaihui Cheng, Chule Yang, Xiao Liu, Naiyang Guan, Zhiyuan Wang
- Abstract summary: Few-shot classification aims to adapt to new tasks with limited labeled examples.
Recent methods explore suitable measures for the similarity between the query and support images.
We propose a Language-guided Prototypical Network (LPN) for few-shot classification.
- Score: 16.37959398470535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Few-shot classification aims to adapt to new tasks with limited labeled
examples. To fully use the accessible data, recent methods explore suitable
measures for the similarity between the query and support images and better
high-dimensional features with meta-training and pre-training strategies.
However, the potential of multi-modality information has barely been explored,
which may bring promising improvement for few-shot classification. In this
paper, we propose a Language-guided Prototypical Network (LPN) for few-shot
classification, which leverages the complementarity of vision and language
modalities via two parallel branches to improve the classifier. Concretely, to
introduce language modality with limited samples in the visual task, we
leverage a pre-trained text encoder to extract class-level text features
directly from class names while processing images with a conventional image
encoder. Then, we introduce a language-guided decoder to obtain text features
corresponding to each image by aligning class-level features with visual
features. Additionally, we utilize class-level features and prototypes to build
a refined prototypical head, which generates robust prototypes in the text
branch for follow-up measurement. Furthermore, we leverage the class-level
features to align the visual features, capturing more class-relevant visual
features. Finally, we aggregate the visual and text logits to calibrate the
deviation of a single modality, enhancing the overall performance. Extensive
experiments demonstrate the competitiveness of LPN against state-of-the-art
methods on benchmark datasets.
Related papers
- Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - Multi-Modal Classifiers for Open-Vocabulary Object Detection [104.77331131447541]
The goal of this paper is open-vocabulary object detection (OVOD)
We adopt a standard two-stage object detector architecture.
We explore three ways via: language descriptions, image exemplars, or a combination of the two.
arXiv Detail & Related papers (2023-06-08T18:31:56Z) - OSIC: A New One-Stage Image Captioner Coined [38.46732302316068]
We propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning.
To obtain rich features, we use the Swin Transformer to calculate multi-level features.
To enhance the global modeling of encoder for caption, we propose a new dual-dimensional refining module.
arXiv Detail & Related papers (2022-11-04T08:50:09Z) - Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions.
We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model.
We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.