Related papers: CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

URL: http://arxiv.org/abs/2409.19291v2
Date: Wed, 2 Oct 2024 21:50:58 GMT
Title: CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
Authors: Jihai Zhang, Xiaoye Qu, Tong Zhu, Yu Cheng,
Abstract summary: Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. DMU efficiently fine-tunes a series of CLIP models that capture different feature spaces. Experiments demonstrate the significant performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks.
Score: 21.734200158914476
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies have identified that the information loss in the CLIP encoding process is substantial, and CLIP tends to capture only coarse-grained features from the input. This deficiency significantly limits the ability of a single CLIP model to handle images rich in visual detail. In this work, we propose a simple yet effective model-agnostic strategy, Diversified Multiplet Upcycling (DMU), for CLIP. DMU efficiently fine-tunes a series of CLIP models that capture different feature spaces, from a dense pre-trained CLIP checkpoint, sharing parameters except for the Feed-Forward Network (FFN). These models can then be transformed into a CLIP-MoE with a larger model capacity, leading to significantly enhanced performance with minimal computational overhead. To the best of our knowledge, Diversified Multiplet Upcycling is the first approach to introduce sparsely activated MoE into CLIP foundation models. Extensive experiments demonstrate the significant performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks, and downstream Multimodal Large Language Model (MLLM) benchmarks by serving as a vision encoder. Furthermore, Diversified Multiplet Upcycling enables the conversion of any dense CLIP model into CLIP-MoEs, which can seamlessly replace CLIP in a plug-and-play manner without requiring further adaptation in downstream frameworks. Through Diversified Multiplet Upcycling, we aim to provide valuable insights for future research on developing more efficient and effective multimodal learning systems.

Related papers

un$^2$CLIP: Improving CLIP's Visual Detail Capturing Ability via Inverting unCLIP [75.19266107565109]
Contrastive Language-Image Pre-training (CLIP) has become a foundation model and has been applied to various vision and multimodal tasks.<n>This work focuses on improving existing CLIP models, aiming to capture as many visual details in images as possible.
arXiv Detail & Related papers (2025-05-30T12:29:38Z)
DiffCLIP: Differential Attention Meets CLIP [57.396578974401734]
We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks.
arXiv Detail & Related papers (2025-03-09T14:04:09Z)
CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling [21.65268178160724]
Mixture-of-Experts (MoE) models are crucial for scaling model capacity while controlling inference costs. We propose CLIP-Upcycling (CLIP-UP), an efficient alternative training strategy that converts a pre-trained dense CLIP model into a sparse MoE architecture.
arXiv Detail & Related papers (2025-02-03T00:04:50Z)
Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones. In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining. Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z)
LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation [60.02145113467427]
This work introduces a fine-tuning approach that integrates large language models with the pretrained CLIP visual encoder. To address the challenge of LLMs' autoregressive nature, we propose a caption-to-caption contrastive learning framework. Our method achieves substantial performance gains on various downstream tasks.
arXiv Detail & Related papers (2024-11-07T18:59:16Z)
CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP [56.199779065855004]
We introduce CLIPErase, a novel approach that disentangles and selectively forgets both visual and textual associations. Experiments on the CIFAR-100 and Flickr30K datasets demonstrate that CLIPErase effectively forgets designated associations in zero-shot tasks for multimodal samples.
arXiv Detail & Related papers (2024-10-30T17:51:31Z)
Multi-Modal Adapter for Vision-Language Models [5.040884755454258]
We propose Multi-Modal Adapter, an approach for Multi-Modal adaptation of CLIP. We add a trainable Multi-Head Attention layer that combines text and image features to produce an additive adaptation of both.
arXiv Detail & Related papers (2024-09-03T12:47:08Z)
Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities. CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure. We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z)
Multimodal CLIP Inference for Meta-Few-Shot Image Classification [0.0]
Multimodal foundation models like CLIP learn a joint (image, text) embedding. This study demonstrates that combining modalities from CLIP's text and image encoders outperforms state-of-the-art meta-few-shot learners on widely adopted benchmarks.
arXiv Detail & Related papers (2024-03-26T17:47:54Z)
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts. Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples. We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z)
CLIP-guided Prototype Modulating for Few-shot Action Recognition [49.11385095278407]
This work aims to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue. We present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of a video-text contrastive objective and a prototype modulation.
arXiv Detail & Related papers (2023-03-06T09:17:47Z)
CLIPood: Generalizing CLIP to Out-of-Distributions [73.86353105017076]
Contrastive language-image pre-training (CLIP) models have shown impressive zero-shot ability, but the further adaptation of CLIP on downstream tasks undesirably degrades OOD performances. We propose CLIPood, a fine-tuning method that can adapt CLIP models to OOD situations where both domain shifts and open classes may occur on unseen test data. Experiments on diverse datasets with different OOD scenarios show that CLIPood consistently outperforms existing generalization techniques.
arXiv Detail & Related papers (2023-02-02T04:27:54Z)
CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training [121.46758260964114]
Pre-training across 3D vision and language remains under development because of limited training data. Recent works attempt to transfer vision-language pre-training models to 3D vision. PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification. We propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain.
arXiv Detail & Related papers (2022-10-03T16:13:14Z)
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z)
Personalizing Pre-trained Models [23.145974171912414]
We consider how upstream pretrained models can be leveraged for downstream few-shot, multilabel, and continual learning tasks. Our model CLIPPER (CLIP PERsonalized) uses image representations from CLIP, a large-scale image representation learning model trained using weak natural language supervision.
arXiv Detail & Related papers (2021-06-02T22:58:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.