Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model
- URL: http://arxiv.org/abs/2406.19905v2
- Date: Mon, 5 Aug 2024 12:12:48 GMT
- Title: Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model
- Authors: Longrong Yang, Dong Shen, Chaoxiang Cai, Fan Yang, Size Li, Di Zhang, Xi Li,
- Abstract summary: Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs)
Existing MoE methods in LVLMs encourage different experts to handle different tokens, and they usually employ a router to predict the routing of each token.
This paper proposes a novel method based on token-level gradient analysis, i.e., Solving Token Gradient Conflict (STGC)
- Score: 20.979790612689992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLMs encourage different experts to handle different tokens, and they usually employ a router to predict the routing of each token. However, the predictions are based solely on sample features and do not truly reveal the optimization directions of tokens. This may lead to severe optimization interference between different tokens assigned to an expert. To address this problem, this paper proposes a novel method based on token-level gradient analysis, i.e., Solving Token Gradient Conflict (STGC). Specifically, we first use token-level gradients to identify conflicting tokens in experts. After that, we add a specialized loss tailored to eliminate conflicts among tokens within each expert. Our method can serve as a plug-in for diverse Large Vision-Language Models, and extensive experimental results demonstrate its effectiveness. The code will be publicly available at https://github.com/longrongyang/STGC.
Related papers
- Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks.
VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens.
We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy.
By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z) - GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory [49.536752342048075]
Mixture-of-Experts (MoE) has been demonstrated as an efficient method to scale up models.
We propose a new fine-tuning method, GW-MoE, to address this issue.
arXiv Detail & Related papers (2024-06-18T08:03:51Z) - TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction.
Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution.
This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models [40.651650382105636]
Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples.
We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead.
Our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens.
arXiv Detail & Related papers (2024-05-13T08:24:21Z) - Hot or Cold? Adaptive Temperature Sampling for Code Generation with
Large Language Models [54.72004797421481]
We conduct the first systematic study to explore a decoding strategy specialized in code generation.
Inspired by the above findings, we propose a simple yet effective method: Adaptive Temperature (AdapT) sampling.
Results show that AdapT sampling significantly outperforms state-of-the-art decoding strategy.
arXiv Detail & Related papers (2023-09-06T06:27:33Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.