A streamlined Approach to Multimodal Few-Shot Class Incremental Learning
for Fine-Grained Datasets
- URL: http://arxiv.org/abs/2403.06295v1
- Date: Sun, 10 Mar 2024 19:50:03 GMT
- Title: A streamlined Approach to Multimodal Few-Shot Class Incremental Learning
for Fine-Grained Datasets
- Authors: Thang Doan, Sima Behpour, Xin Li, Wenbin He, Liang Gou, Liu Ren
- Abstract summary: Few-shot Class-Incremental Learning (FSCIL) poses the challenge of retaining prior knowledge while learning from limited new data streams.
We propose Session-Specific Prompts (SSP), which enhances the separability of image-text embeddings across sessions.
The second, Hyperbolic distance, compresses representations of image-text pairs within the same class while expanding those from different classes, leading to better representations.
- Score: 23.005760505169803
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Few-shot Class-Incremental Learning (FSCIL) poses the challenge of retaining
prior knowledge while learning from limited new data streams, all without
overfitting. The rise of Vision-Language models (VLMs) has unlocked numerous
applications, leveraging their existing knowledge to fine-tune on custom data.
However, training the whole model is computationally prohibitive, and VLMs
while being versatile in general domains still struggle with fine-grained
datasets crucial for many applications. We tackle these challenges with two
proposed simple modules. The first, Session-Specific Prompts (SSP), enhances
the separability of image-text embeddings across sessions. The second,
Hyperbolic distance, compresses representations of image-text pairs within the
same class while expanding those from different classes, leading to better
representations. Experimental results demonstrate an average 10-point increase
compared to baselines while requiring at least 8 times fewer trainable
parameters. This improvement is further underscored on our three newly
introduced fine-grained datasets.
Related papers
- Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model [43.738677778740325]
We propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle.
Candle achieves state-of-the-art performance over extensive experiments on 11 diverse datasets.
arXiv Detail & Related papers (2024-06-18T14:07:13Z) - Conditional Prototype Rectification Prompt Learning [32.533844163120875]
We propose a Prototype Rectification Prompt Learning (CPR) method to correct the bias of base examples and augment limited data in an effective way.
CPR achieves state-of-the-art performance on both few-shot classification and base-to-new generalization tasks.
arXiv Detail & Related papers (2024-04-15T15:43:52Z) - Convolutional Prompting meets Language Models for Continual Learning [4.115213208594654]
Continual Learning (CL) enables machine learning models to learn from continuously shifting new training data in absence of data from old tasks.
We propose ConvPrompt, a novel convolutional prompt creation mechanism that maintains layer-wise shared embeddings.
The intelligent use of convolution enables us to maintain a low parameter overhead without compromising performance.
arXiv Detail & Related papers (2024-03-29T17:40:37Z) - PL-FSCIL: Harnessing the Power of Prompts for Few-Shot Class-Incremental Learning [9.247718160705512]
Few-Shot Class-Incremental Learning (FSCIL) aims to enable deep neural networks to learn new tasks incrementally from a small number of labeled samples.
We propose a novel approach called Prompt Learning for FSCIL (PL-FSCIL)
PL-FSCIL harnesses the power of prompts in conjunction with a pre-trained Vision Transformer (ViT) model to address the challenges of FSCIL effectively.
arXiv Detail & Related papers (2024-01-26T12:11:04Z) - Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs.
We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z) - Learning without Forgetting for Vision-Language Models [65.49600786387106]
Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world.
Recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations.
We propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting.
arXiv Detail & Related papers (2023-05-30T17:59:32Z) - Learning Multimodal Data Augmentation in Feature Space [65.54623807628536]
LeMDA is an easy-to-use method that automatically learns to jointly augment multimodal data in feature space.
We show that LeMDA can profoundly improve the performance of multimodal deep learning architectures.
arXiv Detail & Related papers (2022-12-29T20:39:36Z) - Pushing the Limits of Simple Pipelines for Few-Shot Learning: External
Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision.
We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Text-Based Person Search with Limited Data [66.26504077270356]
Text-based person search (TBPS) aims at retrieving a target person from an image gallery with a descriptive text query.
We present a framework with two novel components to handle the problems brought by limited data.
arXiv Detail & Related papers (2021-10-20T22:20:47Z) - Complementing Representation Deficiency in Few-shot Image
Classification: A Meta-Learning Approach [27.350615059290348]
We propose a meta-learning approach with complemented representations network (MCRNet) for few-shot image classification.
In particular, we embed a latent space, where latent codes are reconstructed with extra representation information to complement the representation deficiency.
Our end-to-end framework achieves the state-of-the-art performance in image classification on three standard few-shot learning datasets.
arXiv Detail & Related papers (2020-07-21T13:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.