Transferring Pre-trained Multimodal Representations with Cross-modal
Similarity Matching
- URL: http://arxiv.org/abs/2301.02903v1
- Date: Sat, 7 Jan 2023 17:24:11 GMT
- Title: Transferring Pre-trained Multimodal Representations with Cross-modal
Similarity Matching
- Authors: Byoungjip Kim, Sungik Choi, Dasol Hwang, Moontae Lee, Honglak Lee
- Abstract summary: In this paper, we propose a method that can effectively transfer the representations of a large pre-trained multimodal model into a small target model.
For unsupervised transfer, we introduce cross-modal similarity matching (CSM) that enables a student model to learn the representations of a teacher model.
To better encode the text prompts, we design context-based prompt augmentation (CPA) that can alleviate the lexical ambiguity of input text prompts.
- Score: 49.730741713652435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite surprising performance on zero-shot transfer, pre-training a
large-scale multimodal model is often prohibitive as it requires a huge amount
of data and computing resources. In this paper, we propose a method (BeamCLIP)
that can effectively transfer the representations of a large pre-trained
multimodal model (CLIP-ViT) into a small target model (e.g., ResNet-18). For
unsupervised transfer, we introduce cross-modal similarity matching (CSM) that
enables a student model to learn the representations of a teacher model by
matching the relative similarity distribution across text prompt embeddings. To
better encode the text prompts, we design context-based prompt augmentation
(CPA) that can alleviate the lexical ambiguity of input text prompts. Our
experiments show that unsupervised representation transfer of a pre-trained
vision-language model enables a small ResNet-18 to achieve a better ImageNet-1K
top-1 linear probe accuracy (66.2%) than vision-only self-supervised learning
(SSL) methods (e.g., SimCLR: 51.8%, SwAV: 63.7%), while closing the gap with
supervised learning (69.8%).
Related papers
- ScaleKD: Strong Vision Transformers Could Be Excellent Teachers [15.446480934024652]
We present a simple and effective knowledge distillation method, called ScaleKD.
Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets.
When scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties.
arXiv Detail & Related papers (2024-11-11T08:25:21Z) - Robust Multimodal Learning via Representation Decoupling [6.7678581401558295]
Multimodal learning has attracted increasing attention due to its practicality.
Existing methods tend to address it by learning a common subspace representation for different modality combinations.
We propose a novel Decoupled Multimodal Representation Network (DMRNet) to assist robust multimodal learning.
arXiv Detail & Related papers (2024-07-05T12:09:33Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot
Vision-Language Tasks [60.46473247205654]
Using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models.
Our experiments show that unimodal pre-training outperforms state-of-the-art CLIP-based models.
arXiv Detail & Related papers (2023-06-07T18:26:22Z) - Black Box Few-Shot Adaptation for Vision-Language models [41.49584259596654]
Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners.
We describe a black-box method for V-L few-shot adaptation that operates on pre-computed image and text features.
We propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain.
arXiv Detail & Related papers (2023-04-04T12:42:29Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - LiT: Zero-Shot Transfer with Locked-image Text Tuning [68.78877201319811]
"Locked-image Text tuning" (LiT-tuning) teaches a text model to read out good representations from a pre-trained image model for new tasks.
A LiT-tuned model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval.
arXiv Detail & Related papers (2021-11-15T18:53:48Z) - WenLan: Bridging Vision and Language by Large-Scale Multi-Modal
Pre-Training [71.37731379031487]
We propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework.
Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario.
By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources.
arXiv Detail & Related papers (2021-03-11T09:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.