Learning without Forgetting for Vision-Language Models
- URL: http://arxiv.org/abs/2305.19270v1
- Date: Tue, 30 May 2023 17:59:32 GMT
- Title: Learning without Forgetting for Vision-Language Models
- Authors: Da-Wei Zhou, Yuanhan Zhang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan,
Ziwei Liu
- Abstract summary: Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world.
Recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations.
We propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting.
- Score: 65.49600786387106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Class-Incremental Learning (CIL) or continual learning is a desired
capability in the real world, which requires a learning system to adapt to new
tasks without forgetting former ones. While traditional CIL methods focus on
visual information to grasp core features, recent advances in Vision-Language
Models (VLM) have shown promising capabilities in learning generalizable
representations with the aid of textual information. However, when continually
trained with new classes, VLMs often suffer from catastrophic forgetting of
former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to
adapt the model without forgetting; and 2) how to make full use of the
multi-modal information. To this end, we propose PROjectiOn Fusion (PROOF) that
enables VLMs to learn without forgetting. To handle the first challenge, we
propose training task-specific projections based on the frozen image/text
encoders. When facing new tasks, new projections are expanded and former
projections are fixed, alleviating the forgetting of old concepts. For the
second challenge, we propose the fusion module to better utilize the
cross-modality information. By jointly adjusting visual and textual features,
the model can capture semantic information with stronger representation
ability. Extensive experiments on nine benchmark datasets validate PROOF
achieves state-of-the-art performance.
Related papers
- Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model [43.738677778740325]
We propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle.
Candle achieves state-of-the-art performance over extensive experiments on 11 diverse datasets.
arXiv Detail & Related papers (2024-06-18T14:07:13Z) - MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning
for Multimodal Video Captioning [10.95493493610559]
We propose a method to Mitigate Catastrophic Forgetting in class-incremental learning for multimodal Video Captioning (MCF-VC)
In order to better constrain the knowledge characteristics of old and new tasks at the specific feature level, we have created the Two-stage Knowledge Distillation (TsKD)
Our experiments on the public dataset MSR-VTT show that the proposed method significantly resists the forgetting of previous tasks without replaying old samples, and performs well on the new task.
arXiv Detail & Related papers (2024-02-27T16:54:08Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - Exploring Effective Factors for Improving Visual In-Context Learning [56.14208975380607]
In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models.
This paper shows that prompt selection and prompt fusion are two major factors that have a direct impact on the inference performance of visual context learning.
We propose a simple framework prompt-SelF for visual in-context learning.
arXiv Detail & Related papers (2023-04-10T17:59:04Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.