CTP: Towards Vision-Language Continual Pretraining via Compatible
Momentum Contrast and Topology Preservation
- URL: http://arxiv.org/abs/2308.07146v1
- Date: Mon, 14 Aug 2023 13:53:18 GMT
- Title: CTP: Towards Vision-Language Continual Pretraining via Compatible
Momentum Contrast and Topology Preservation
- Authors: Hongguang Zhu, Yunchao Wei, Xiaodan Liang, Chunjie Zhang, Yao Zhao
- Abstract summary: Vision-Language Continual Pretraining (VLCP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets.
To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D.
The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data.
- Score: 128.00940554196976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Pretraining (VLP) has shown impressive results on diverse
downstream tasks by offline training on large-scale datasets. Regarding the
growing nature of real-world data, such an offline training paradigm on
ever-expanding data is unsustainable, because models lack the continual
learning ability to accumulate knowledge constantly. However, most continual
learning studies are limited to uni-modal classification and existing
multi-modal datasets cannot simulate continual non-stationary data stream
scenarios. To support the study of Vision-Language Continual Pretraining
(VLCP), we first contribute a comprehensive and unified benchmark dataset P9D
which contains over one million product image-text pairs from 9 industries. The
data from each industry as an independent task supports continual learning and
conforms to the real-world long-tail nature to simulate pretraining on web
data. We comprehensively study the characteristics and challenges of VLCP, and
propose a new algorithm: Compatible momentum contrast with Topology
Preservation, dubbed CTP. The compatible momentum model absorbs the knowledge
of the current and previous-task models to flexibly update the modal feature.
Moreover, Topology Preservation transfers the knowledge of embedding across
tasks while preserving the flexibility of feature adjustment. The experimental
results demonstrate our method not only achieves superior performance compared
with other baselines but also does not bring an expensive training burden.
Dataset and codes are available at https://github.com/KevinLight831/CTP.
Related papers
- Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - Continual Learning for Multimodal Data Fusion of a Soft Gripper [1.0589208420411014]
A model trained on one data modality often fails when tested with a different modality.
We introduce a continual learning algorithm capable of incrementally learning different data modalities.
We evaluate the algorithm's effectiveness on a challenging custom multimodal dataset.
arXiv Detail & Related papers (2024-09-20T09:53:27Z) - Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training [44.790636524264]
Point Prompt Training is a novel framework for multi-dataset synergistic learning in the context of 3D representation learning.
It can overcome the negative transfer associated with synergistic learning and produce generalizable representations.
It achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training.
arXiv Detail & Related papers (2023-08-18T17:59:57Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Continual Vision-Language Representation Learning with Off-Diagonal
Information [112.39419069447902]
Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training.
This paper discusses the feasibility of continual CLIP training using streaming data.
arXiv Detail & Related papers (2023-05-11T08:04:46Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Online Continual Learning with Natural Distribution Shifts: An Empirical
Study with Visual Data [101.6195176510611]
"Online" continual learning enables evaluating both information retention and online learning efficacy.
In online continual learning, each incoming small batch of data is first used for testing and then added to the training set, making the problem truly online.
We introduce a new benchmark for online continual visual learning that exhibits large scale and natural distribution shifts.
arXiv Detail & Related papers (2021-08-20T06:17:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.