FILM: How can Few-Shot Image Classification Benefit from Pre-Trained
Language Models?
- URL: http://arxiv.org/abs/2307.04114v1
- Date: Sun, 9 Jul 2023 08:07:43 GMT
- Title: FILM: How can Few-Shot Image Classification Benefit from Pre-Trained
Language Models?
- Authors: Zihao Jiang, Yunkai Dang, Dong Pang, Huishuai Zhang, Weiran Huang
- Abstract summary: Few-shot learning aims to train models that can be generalized to novel classes with only a few samples.
We propose a novel few-shot learning framework that uses pre-trained language models based on contrastive learning.
- Score: 14.582209994281374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Few-shot learning aims to train models that can be generalized to novel
classes with only a few samples. Recently, a line of works are proposed to
enhance few-shot learning with accessible semantic information from class
names. However, these works focus on improving existing modules such as visual
prototypes and feature extractors of the standard few-shot learning framework.
This limits the full potential use of semantic information. In this paper, we
propose a novel few-shot learning framework that uses pre-trained language
models based on contrastive learning. To address the challenge of alignment
between visual features and textual embeddings obtained from text-based
pre-trained language model, we carefully design the textual branch of our
framework and introduce a metric module to generalize the cosine similarity.
For better transferability, we let the metric module adapt to different
few-shot tasks and adopt MAML to train the model via bi-level optimization.
Moreover, we conduct extensive experiments on multiple benchmarks to
demonstrate the effectiveness of our method.
Related papers
- Less is More: A Closer Look at Semantic-based Few-Shot Learning [11.724194320966959]
Few-shot Learning aims to learn and distinguish new categories with a very limited number of available images.
We propose a simple but effective framework for few-shot learning tasks, specifically designed to exploit the textual information and language model.
Our experiments conducted across four widely used few-shot datasets demonstrate that our simple framework achieves impressive results.
arXiv Detail & Related papers (2024-01-10T08:56:02Z) - Contrastive Alignment of Vision to Language Through Parameter-Efficient
Transfer Learning [60.26952378997713]
Contrastive vision-language models (e.g. CLIP) are created by updating all the parameters of a vision model and language model through contrastive training.
We show that a minimal set of parameter updates ($$7%) can achieve the same performance as full-model training.
We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training.
arXiv Detail & Related papers (2023-03-21T14:12:08Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Contrastive Learning for Prompt-Based Few-Shot Language Learners [14.244787327283335]
We present a contrastive learning framework that clusters inputs from the same class under different augmented "views"
We create different "views" of an example by appending it with different language prompts and contextual demonstrations.
Our method can improve over the state-of-the-art methods in a diverse set of 15 language tasks.
arXiv Detail & Related papers (2022-05-03T04:56:45Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods
in Natural Language Processing [78.8500633981247]
This paper surveys and organizes research works in a new paradigm in natural language processing, which we dub "prompt-based learning"
Unlike traditional supervised learning, which trains a model to take in an input x and predict an output y as P(y|x), prompt-based learning is based on language models that model the probability of text directly.
arXiv Detail & Related papers (2021-07-28T18:09:46Z) - Multimodal Few-Shot Learning with Frozen Language Models [36.75551859968596]
We train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption.
The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples.
arXiv Detail & Related papers (2021-06-25T21:07:09Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.