Related papers: Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

URL: http://arxiv.org/abs/2406.19905v2
Date: Mon, 5 Aug 2024 12:12:48 GMT
Title: Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model
Authors: Longrong Yang, Dong Shen, Chaoxiang Cai, Fan Yang, Size Li, Di Zhang, Xi Li,
Abstract summary: Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs) Existing MoE methods in LVLMs encourage different experts to handle different tokens, and they usually employ a router to predict the routing of each token. This paper proposes a novel method based on token-level gradient analysis, i.e., Solving Token Gradient Conflict (STGC)
Score: 20.979790612689992
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLMs encourage different experts to handle different tokens, and they usually employ a router to predict the routing of each token. However, the predictions are based solely on sample features and do not truly reveal the optimization directions of tokens. This may lead to severe optimization interference between different tokens assigned to an expert. To address this problem, this paper proposes a novel method based on token-level gradient analysis, i.e., Solving Token Gradient Conflict (STGC). Specifically, we first use token-level gradients to identify conflicting tokens in experts. After that, we add a specialized loss tailored to eliminate conflicts among tokens within each expert. Our method can serve as a plug-in for diverse Large Vision-Language Models, and extensive experimental results demonstrate its effectiveness. The code will be publicly available at https://github.com/longrongyang/STGC.

Related papers

Token-Level Prompt Mixture with Parameter-Free Routing for Federated Domain Generalization [51.562474873972086]
Federated domain generalization (FedDG) aims to learn a globally generalizable model from decentralized clients with heterogeneous data. Recent studies have introduced prompt learning to adapt vision-language models (VLMs) in FedDG by learning a single global prompt. We propose TRIP, a Token-level prompt mixture with parameter-free routing framework for FedDG.
arXiv Detail & Related papers (2025-04-29T11:06:03Z)
Demystifying Singular Defects in Large Language Models [61.98878352956125]
In large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored. We provide both theoretical insights and empirical validation across a range of recent models. We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures.
arXiv Detail & Related papers (2025-02-10T20:09:16Z)
Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning [46.43130011147807]
We argue that tokens serving different roles - specifically, reasoning tokens versus boilerplate tokens - differ significantly in importance and learning complexity. We propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token discrimination. Using SHAD, we propose the Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes reasoning tokens during fine-tuning.
arXiv Detail & Related papers (2024-12-19T12:06:24Z)
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes. We present a novel framework for identifying these tokens through rollout sampling. We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z)
NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z)
Inference Optimal VLMs Need Only One Visual Token but Larger Models [54.01228554126122]
Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. VLMs are often constrained by high latency during inference due to substantial compute required to process the large number of input tokens. We take some initial steps towards building approaches tailored for high token compression settings.
arXiv Detail & Related papers (2024-11-05T18:54:21Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy. By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z)
GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory [49.536752342048075]
Mixture-of-Experts (MoE) has been demonstrated as an efficient method to scale up models. We propose a new fine-tuning method, GW-MoE, to address this issue.
arXiv Detail & Related papers (2024-06-18T08:03:51Z)
TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z)
EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models [40.651650382105636]
Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples. We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead. Our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens.
arXiv Detail & Related papers (2024-05-13T08:24:21Z)
Hot or Cold? Adaptive Temperature Sampling for Code Generation with Large Language Models [54.72004797421481]
We conduct the first systematic study to explore a decoding strategy specialized in code generation. Inspired by the above findings, we propose a simple yet effective method: Adaptive Temperature (AdapT) sampling. Results show that AdapT sampling significantly outperforms state-of-the-art decoding strategy.
arXiv Detail & Related papers (2023-09-06T06:27:33Z)
Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks [74.68583356645276]
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis. We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
arXiv Detail & Related papers (2023-06-07T00:16:10Z)
On the Representation Collapse of Sparse Mixture of Experts [102.83396489230375]
Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse.
arXiv Detail & Related papers (2022-04-20T01:40:19Z)
Mixture-of-Experts with Expert Choice Routing [44.777850078713634]
Prior work allocates a fixed number of experts to each token using a top-k function. We propose a heterogeneous mixture-of-experts employing an expert choice method. Our method improves training convergence time by more than 2x.
arXiv Detail & Related papers (2022-02-18T17:46:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.