Related papers: Aligning Visual Prototypes with BERT Embeddings for Few-Shot Learning

Aligning Visual Prototypes with BERT Embeddings for Few-Shot Learning

URL: http://arxiv.org/abs/2105.10195v1
Date: Fri, 21 May 2021 08:08:28 GMT
Title: Aligning Visual Prototypes with BERT Embeddings for Few-Shot Learning
Authors: Kun Yan, Zied Bouraoui, Ping Wang, Shoaib Jameel, Steven Schockaert
Abstract summary: Few-shot learning is the task of learning to recognize previously unseen categories of images. We propose a method that takes into account the names of the image classes.
Score: 48.583388368897126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Few-shot learning (FSL) is the task of learning to recognize previously unseen categories of images from a small number of training examples. This is a challenging task, as the available examples may not be enough to unambiguously determine which visual features are most characteristic of the considered categories. To alleviate this issue, we propose a method that additionally takes into account the names of the image classes. While the use of class names has already been explored in previous work, our approach differs in two key aspects. First, while previous work has aimed to directly predict visual prototypes from word embeddings, we found that better results can be obtained by treating visual and text-based prototypes separately. Second, we propose a simple strategy for learning class name embeddings using the BERT language model, which we found to substantially outperform the GloVe vectors that were used in previous work. We furthermore propose a strategy for dealing with the high dimensionality of these vectors, inspired by models for aligning cross-lingual word embeddings. We provide experiments on miniImageNet, CUB and tieredImageNet, showing that our approach consistently improves the state-of-the-art in metric-based FSL.

Related papers

Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models [7.374726900469744]
Open-vocabulary semantic segmentation attempts to classify and outline objects in an image using arbitrary text labels.<n>This study investigates simple yet efficient methods for adapting previously learned foundation models for open-vocabulary semantic segmentation tasks.<n>We propose "Beyond-Labels", a lightweight transformer-based fusion module that uses a small amount of image segmentation data to fuse frozen visual representations with language concepts.
arXiv Detail & Related papers (2025-01-28T07:49:52Z)
TAI++: Text as Image for Multi-Label Image Classification by Co-Learning Transferable Prompt [15.259819430801402]
We propose a pseudo-visual prompt(PVP) module for implicit visual prompt tuning to address this problem. Specifically, we first learn the pseudo-visual prompt for each category, mining diverse visual knowledge by the well-aligned space of pre-trained vision-language models. Experimental results on VOC2007, MS-COCO, and NUSWIDE datasets demonstrate that our method can surpass state-of-the-art(SOTA) methods.
arXiv Detail & Related papers (2024-05-11T06:11:42Z)
Mixture of Self-Supervised Learning [2.191505742658975]
Self-supervised learning works by using a pretext task which will be trained on the model before being applied to a specific task. Previous studies have only used one type of transformation as a pretext task. This raises the question of how it affects if more than one pretext task is used and to use a gating network to combine all pretext tasks.
arXiv Detail & Related papers (2023-07-27T14:38:32Z)
Text Descriptions are Compressive and Invariant Representations for Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting. In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors). This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z)
LPN: Language-guided Prototypical Network for few-shot classification [16.37959398470535]
Few-shot classification aims to adapt to new tasks with limited labeled examples. Recent methods explore suitable measures for the similarity between the query and support images. We propose a Language-guided Prototypical Network (LPN) for few-shot classification.
arXiv Detail & Related papers (2023-07-04T06:54:01Z)
Exploiting Category Names for Few-Shot Classification with Vision-Language Models [78.51975804319149]
Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks. This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head.
arXiv Detail & Related papers (2022-11-29T21:08:46Z)
Semantic Cross Attention for Few-shot Learning [9.529264466445236]
We propose a multi-task learning approach to view semantic features of label text as an auxiliary task. Our proposed model uses word-embedding representations as semantic features to help train the embedding network and a semantic cross-attention module to bridge the semantic features into the typical visual modal.
arXiv Detail & Related papers (2022-10-12T15:24:59Z)
I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes. We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents. Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z)
DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z)
Webly Supervised Semantic Embeddings for Large Scale Zero-Shot Learning [8.472636806304273]
Zero-shot learning (ZSL) makes object recognition in images possible in absence of visual training data for a part of the classes from a dataset. We focus on the problem of semantic class prototype design for large scale ZSL. We investigate the use of noisy textual metadata associated to photos as text collections.
arXiv Detail & Related papers (2020-08-06T21:33:44Z)
Distilling Localization for Self-Supervised Representation Learning [82.79808902674282]
Contrastive learning has revolutionized unsupervised representation learning. Current contrastive models are ineffective at localizing the foreground object. We propose a data-driven approach for learning in variance to backgrounds.
arXiv Detail & Related papers (2020-04-14T16:29:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.