Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
of Experts
- URL: http://arxiv.org/abs/2206.02770v1
- Date: Mon, 6 Jun 2022 17:51:59 GMT
- Title: Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
of Experts
- Authors: Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton and
Neil Houlsby
- Abstract summary: We present the Language-Image MoE, LIMoE, a sparse mixture of experts model capable of multimodal learning.
LIMoE accepts both images and text simultaneously, while being trained using a contrastive loss.
Across multiple scales, we demonstrate remarkable performance improvement over dense models of equivalent computational cost.
- Score: 26.041404520616073
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large sparsely-activated models have obtained excellent performance in
multiple domains. However, such models are typically trained on a single
modality at a time. We present the Language-Image MoE, LIMoE, a sparse mixture
of experts model capable of multimodal learning. LIMoE accepts both images and
text simultaneously, while being trained using a contrastive loss. MoEs are a
natural fit for a multimodal backbone, since expert layers can learn an
appropriate partitioning of modalities. However, new challenges arise; in
particular, training stability and balanced expert utilization, for which we
propose an entropy-based regularization scheme. Across multiple scales, we
demonstrate remarkable performance improvement over dense models of equivalent
computational cost. LIMoE-L/16 trained comparably to CLIP-L/14 achieves 78.6%
zero-shot ImageNet accuracy (vs. 76.2%), and when further scaled to H/14 (with
additional data) it achieves 84.1%, comparable to state-of-the-art methods
which use larger custom per-modality backbones and pre-training schemes. We
analyse the quantitative and qualitative behavior of LIMoE, and demonstrate
phenomena such as differing treatment of the modalities and the organic
emergence of modality-specific experts.
Related papers
- A Simple Approach to Unifying Diffusion-based Conditional Generation [63.389616350290595]
We introduce a simple, unified framework to handle diverse conditional generation tasks.
Our approach enables versatile capabilities via different inference-time sampling schemes.
Our model supports additional capabilities like non-spatially aligned and coarse conditioning.
arXiv Detail & Related papers (2024-10-15T09:41:43Z) - HMoE: Heterogeneous Mixture of Experts for Language Modeling [45.65121689677227]
Traditionally, Mixture of Experts (MoE) models use homogeneous experts, each with identical capacity.
We propose a novel Heterogeneous Mixture of Experts (HMoE) where experts differ in size and thus possess diverse capacities.
HMoE achieves lower loss with fewer activated parameters and outperforms conventional homogeneous MoE models on various pre-training evaluation benchmarks.
arXiv Detail & Related papers (2024-08-20T09:35:24Z) - DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception [66.88792390480343]
We propose DEEM, a simple but effective approach that utilizes the generative feedback of diffusion models to align the semantic distributions of the image encoder.
DEEM exhibits enhanced robustness and a superior capacity to alleviate model hallucinations while utilizing fewer trainable parameters, less pre-training data, and a smaller base model size.
arXiv Detail & Related papers (2024-05-24T05:46:04Z) - Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning [50.73666458313015]
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications.
MoE has been emerged as a promising solution with its sparse architecture for effective task decoupling.
Intuition-MoR1E achieves superior efficiency and 2.15% overall accuracy improvement across 14 public datasets.
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models.
The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss.
Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z) - Mixture of Low-rank Experts for Transferable AI-Generated Image Detection [18.631006488565664]
Generative models have shown a giant leap in photo-realistic images with minimal expertise, sparking concerns about the authenticity of online information.
This study aims to develop a universal AI-generated image detector capable of identifying images from diverse sources.
Inspired by the zero-shot transferability of pre-trained vision-language models, we seek to harness the non-trivial visual-world knowledge and descriptive proficiency of CLIP-ViT to generalize over unknown domains.
arXiv Detail & Related papers (2024-04-07T09:01:50Z) - Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning.
In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity.
We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.