Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training
- URL: http://arxiv.org/abs/2207.12661v1
- Date: Tue, 26 Jul 2022 05:19:16 GMT
- Title: Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training
- Authors: Haoxuan You, Luowei Zhou, Bin Xiao, Noel Codella, Yu Cheng, Ruochen
Xu, Shih-Fu Chang, Lu Yuan
- Abstract summary: We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
- Score: 88.80694147730883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale multi-modal contrastive pre-training has demonstrated great
utility to learn transferable features for a range of downstream tasks by
mapping multiple modalities into a shared embedding space. Typically, this has
employed separate encoders for each modality. However, recent work suggests
that transformers can support learning across multiple modalities and allow
knowledge sharing. Inspired by this, we investigate a variety of
Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
More specifically, we question how many parameters of a transformer model can
be shared across modalities during contrastive pre-training, and rigorously
examine architectural design choices that position the proportion of parameters
shared along a spectrum. In studied conditions, we observe that a mostly
unified encoder for vision and language signals outperforms all other
variations that separate more parameters. Additionally, we find that
light-weight modality-specific parallel modules further improve performance.
Experimental results show that the proposed MS-CLIP approach outperforms
vanilla CLIP by up to 13\% relative in zero-shot ImageNet classification
(pre-trained on YFCC-100M), while simultaneously supporting a reduction of
parameters. In addition, our approach outperforms vanilla CLIP by 1.6 points in
linear probing on a collection of 24 downstream vision tasks. Furthermore, we
discover that sharing parameters leads to semantic concepts from different
modalities being encoded more closely in the embedding space, facilitating the
transferring of common semantic structure (e.g., attention patterns) from
language to vision. Code is available at
\href{https://github.com/Hxyou/MSCLIP}{URL}.
Related papers
- Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling [58.50618448027103]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
This paper explores the differences across various CLIP-trained vision backbones.
Method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone.
arXiv Detail & Related papers (2024-05-27T12:59:35Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - LightCLIP: Learning Multi-Level Interaction for Lightweight
Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models.
An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition [41.78245303513613]
We introduce MA-FSAR, a framework that employs the Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations.
In addition to these token-level designs, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes.
arXiv Detail & Related papers (2023-08-03T04:17:25Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Multi-scale and Cross-scale Contrastive Learning for Semantic
Segmentation [5.281694565226513]
We apply contrastive learning to enhance the discriminative power of the multi-scale features extracted by semantic segmentation networks.
By first mapping the encoder's multi-scale representations to a common feature space, we instantiate a novel form of supervised local-global constraint.
arXiv Detail & Related papers (2022-03-25T01:24:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.