Continual Vision-Language Representation Learning with Off-Diagonal
Information
- URL: http://arxiv.org/abs/2305.07437v5
- Date: Thu, 1 Jun 2023 16:22:00 GMT
- Title: Continual Vision-Language Representation Learning with Off-Diagonal
Information
- Authors: Zixuan Ni and Longhui Wei and Siliang Tang and Yueting Zhuang and Qi
Tian
- Abstract summary: Multi-modal contrastive learning frameworks like CLIP typically require a large amount of image-text samples for training.
This paper discusses the feasibility of continual CLIP training using streaming data.
- Score: 112.39419069447902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale multi-modal contrastive learning frameworks like CLIP typically
require a large amount of image-text samples for training. However, these
samples are always collected continuously in real scenarios. This paper
discusses the feasibility of continual CLIP training using streaming data.
Unlike continual learning based on self-supervised learning methods for pure
images, which is empirically robust against catastrophic forgetting, CLIP's
performance degeneration in the continual setting is significant and
non-neglectable. By analyzing the changes in the model's representation space
during continual CLIP training from a spatial geometry perspective, we explore
and summarize these spatial variations as Spatial Disorder (SD), which can be
divided into Intra-modal Rotation and Inter-modal Deviation. Moreover, we
empirically and theoretically demonstrate how SD leads to a performance decline
for CLIP on cross-modal retrieval tasks. To alleviate SD, we propose a new
continual vision-language representation learning framework Mod-X: Maintain
off-diagonal information-matriX. By selectively aligning the off-diagonal
information distribution of contrastive matrices, the Mod-X improves the
capability of the multi-modal model by maintaining the multi-modal
representation space alignment on the old data domain during continuously
fitting the new training data domain. Experiments on commonly used datasets
with different scales and scopes have demonstrated the effectiveness of our
method.
Related papers
- Learning Equi-angular Representations for Online Continual Learning [28.047867978274358]
In particular, we induce neural collapse to form a simplex equiangular tight frame (ETF) structure in the representation space.
We show that our proposed method outperforms state-of-the-art methods by a noticeable margin in various online continual learning scenarios.
arXiv Detail & Related papers (2024-04-02T04:29:01Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - CTP: Towards Vision-Language Continual Pretraining via Compatible
Momentum Contrast and Topology Preservation [128.00940554196976]
Vision-Language Continual Pretraining (VLCP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets.
To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D.
The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data.
arXiv Detail & Related papers (2023-08-14T13:53:18Z) - Self-aware and Cross-sample Prototypical Learning for Semi-supervised
Medical Image Segmentation [10.18427897663732]
Consistency learning plays a crucial role in semi-supervised medical image segmentation.
It enables the effective utilization of limited annotated data while leveraging the abundance of unannotated data.
We propose a self-aware and cross-sample prototypical learning method ( SCP-Net) to enhance the diversity of prediction in consistency learning.
arXiv Detail & Related papers (2023-05-25T16:22:04Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - Beyond Just Vision: A Review on Self-Supervised Representation Learning
on Multimodal and Temporal Data [10.006890915441987]
Popularity of self-supervised learning is driven by the fact that traditional models typically require a huge amount of well-annotated data for training.
Self-supervised methods have been introduced to improve the efficiency of training data through discriminative pre-training of models.
We aim to provide the first comprehensive review of multimodal self-supervised learning methods for temporal data.
arXiv Detail & Related papers (2022-06-06T04:59:44Z) - Robust Cross-Modal Representation Learning with Progressive
Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets.
We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.