Inferring Latent Class Statistics from Text for Robust Visual Few-Shot
Learning
- URL: http://arxiv.org/abs/2311.14544v1
- Date: Fri, 24 Nov 2023 15:23:47 GMT
- Title: Inferring Latent Class Statistics from Text for Robust Visual Few-Shot
Learning
- Authors: Yassir Bendou, Vincent Gripon, Bastien Pasdeloup, Giulia Lioi, Lukas
Mauch, Fabien Cardinaux and Ghouthi Boukli Hacene
- Abstract summary: We present a novel approach that leverages text-derived statistics to predict the mean and covariance of the visual feature distribution for each class.
Our method shows that we can use text to predict the mean and covariance of the distribution offering promising improvements in few-shot learning scenarios.
- Score: 4.300029426596762
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In the realm of few-shot learning, foundation models like CLIP have proven
effective but exhibit limitations in cross-domain robustness especially in
few-shot settings. Recent works add text as an extra modality to enhance the
performance of these models. Most of these approaches treat text as an
auxiliary modality without fully exploring its potential to elucidate the
underlying class visual features distribution. In this paper, we present a
novel approach that leverages text-derived statistics to predict the mean and
covariance of the visual feature distribution for each class. This predictive
framework enriches the latent space, yielding more robust and generalizable
few-shot learning models. We demonstrate the efficacy of incorporating both
mean and covariance statistics in improving few-shot classification performance
across various datasets. Our method shows that we can use text to predict the
mean and covariance of the distribution offering promising improvements in
few-shot learning scenarios.
Related papers
- Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - Enhancing Few-Shot Image Classification through Learnable Multi-Scale Embedding and Attention Mechanisms [1.1557852082644071]
In the context of few-shot classification, the goal is to train a classifier using a limited number of samples.
Traditional metric-based methods exhibit certain limitations in achieving this objective.
Our approach involves utilizing multi-output embedding network that maps samples into distinct feature spaces.
arXiv Detail & Related papers (2024-09-12T12:34:29Z) - BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models [20.88680592729709]
We propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models.
BaFTA directly estimates class centroids using online clustering within a projected embedding space.
We demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.
arXiv Detail & Related papers (2024-06-17T08:16:24Z) - The Common Stability Mechanism behind most Self-Supervised Learning
Approaches [64.40701218561921]
We provide a framework to explain the stability mechanism of different self-supervised learning techniques.
We discuss the working mechanism of contrastive techniques like SimCLR, non-contrastive techniques like BYOL, SWAV, SimSiam, Barlow Twins, and DINO.
We formulate different hypotheses and test them using the Imagenet100 dataset.
arXiv Detail & Related papers (2024-02-22T20:36:24Z) - Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt
Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks.
MLE training can lead the context vector to over-fit dominant image features in the training data.
This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Few-shot Text Classification with Dual Contrastive Consistency [31.141350717029358]
In this paper, we explore how to utilize pre-trained language model to perform few-shot text classification.
We adopt supervised contrastive learning on few labeled data and consistency-regularization on vast unlabeled data.
arXiv Detail & Related papers (2022-09-29T19:26:23Z) - Revisiting Consistency Regularization for Semi-Supervised Learning [80.28461584135967]
We propose an improved consistency regularization framework by a simple yet effective technique, FeatDistLoss.
Experimental results show that our model defines a new state of the art for various datasets and settings.
arXiv Detail & Related papers (2021-12-10T20:46:13Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.