Related papers: Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts

Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts

URL: http://arxiv.org/abs/2509.10530v1
Date: Fri, 05 Sep 2025 02:49:15 GMT
Title: Dynamic Adaptive Shared Experts with Grouped Multi-Head Attention Mixture of Experts
Authors: Cheng Li, Jiexiong Liu, Yixuan Chen, Jie ji,
Abstract summary: We propose a Dynamic Adaptive Shared Expert and Grouped Multi-Head Attention Hybrid Model (DASG-MoE) to enhance long-sequence modeling capabilities.<n>First, we employ the Grouped Multi-Head Attention (GMHA) mechanism to effectively reduce the computational complexity of long sequences.<n>Second, we design a Dual-Scale Shared Expert Structure (DSSE), where shallow experts use lightweight computations to quickly respond to low-dimensional features.<n>Third, we propose a hierarchical Adaptive Dynamic Routing (ADR) mechanism that dynamically selects expert levels based on feature complexity and task requirements.
Score: 10.204413386807564
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer models based on the Mixture of Experts (MoE) architecture have made significant progress in long-sequence modeling, but existing models still have shortcomings in computational efficiency and the ability to capture long-range dependencies, especially in terms of the dynamic adaptability of expert resource allocation. In this paper, we propose a Dynamic Adaptive Shared Expert and Grouped Multi-Head Attention Hybrid Model (DASG-MoE) to enhance long-sequence modeling capabilities by integrating three modules. First, we employ the Grouped Multi-Head Attention (GMHA) mechanism to effectively reduce the computational complexity of long sequences. By parallel processing through sequence grouping, local sliding window attention, and feature aggregation, we address long-range dependency issues and the model's lack of generalization for local information. Second, we design a Dual-Scale Shared Expert Structure (DSSE), where shallow experts use lightweight computations to quickly respond to low-dimensional features, while deep experts process high-dimensional complex semantics through pre-training transfer and post-training optimization, achieving a dynamic balance between efficiency and accuracy. Third, we propose a hierarchical Adaptive Dynamic Routing (ADR) mechanism that dynamically selects expert levels based on feature complexity and task requirements, and optimizes resource allocation through a local expert activation strategy. Experiments on multiple long-sequence benchmark datasets demonstrate that our DASG-MoE model outperforms state-of-the-art models.

Related papers

MixFormer: Co-Scaling Up Dense and Sequence in Industrial Recommenders [11.566232697512879]
MixFormer is a unified Transformer-style architecture tailored for recommender systems.<n>It jointly models sequential behaviors and feature interactions within a single backbone.<n>Experiments on large-scale industrial datasets demonstrate that MixFormer consistently exhibits superior accuracy and efficiency.
arXiv Detail & Related papers (2026-02-15T11:53:30Z)
Fine-Grained Model Merging via Modular Expert Recombination [33.253051407398836]
We propose MERGE, a method that enables component-wise model merging and input-aware, on-demand module recombination at inference.<n> MERGE formulates component-wise merging as a bi-objective optimization problem that balances cross-task performance and storage efficiency.<n>We show that MERGE consistently outperforms strong baselines and generalizes effectively.
arXiv Detail & Related papers (2026-02-06T09:55:56Z)
MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling [60.648359990090846]
State-space models (SSMs) have recently attention as an efficient alternative to computationally expensive attention-based models for sequence modeling.<n>This paper introduces a multi-scale SSM framework that represents sequence dynamics across multiple resolution and processing each resolution with specialized state-space dynamics.
arXiv Detail & Related papers (2025-12-29T19:36:28Z)
H3M-SSMoEs: Hypergraph-based Multimodal Learning with LLM Reasoning and Style-Structured Mixture of Experts [4.041490867852946]
H3M-SSMoEs is a novel Hypergraph-based MultiModal architecture with LLM reasoning and Style-Structured Mixture Experts.<n> experiments on three major stock markets demonstrate that H3M-SSMoEs surpasses state-of-the-art methods in both superior predictive accuracy and investment performance, while exhibiting effective risk control.
arXiv Detail & Related papers (2025-10-29T01:54:52Z)
MGTS-Net: Exploring Graph-Enhanced Multimodal Fusion for Augmented Time Series Forecasting [1.7077661158850292]
We propose MGTS-Net, a Multimodal Graph-enhanced Network for Time Series forecasting.<n>The model consists of three core components: (1) a Multimodal Feature Extraction layer (MFE), (2) a Multimodal Feature Fusion layer (MFF), and (3) a Multi-Scale Prediction layer (MSP)
arXiv Detail & Related papers (2025-10-18T04:47:10Z)
MICACL: Multi-Instance Category-Aware Contrastive Learning for Long-Tailed Dynamic Facial Expression Recognition [12.538204312275935]
We propose a novel multi-instance model learning framework called Dynamic Multiscale Category-aware Contrastive Learning (LMCC)<n>LMCC balance training between major and minor categories.<n>Experiments on in-the-wild datasets demonstrate that MIC achieves stateof-the-art performance with superior faces and generalization.
arXiv Detail & Related papers (2025-09-04T16:03:14Z)
DMSC: Dynamic Multi-Scale Coordination Framework for Time Series Forecasting [14.176801586961286]
Time Series Forecasting (TSF) faces persistent challenges in modeling intricate temporal dependencies across different scales.<n>We propose a novel Dynamic Multi-Scale Coordination Framework (DMSC) with Multi-Scale Patch Decomposition block (EMPD), Triad Interaction Block (TIB) and Adaptive Scale Routing MoE block (ASR-MoE)<n>EMPD is designed as a built-in component to dynamically segment sequences into hierarchical patches with exponentially scaled granularities.<n>TIB then jointly models intra-patch, inter-patch, and cross-variable dependencies within each layer's decomposed representations.
arXiv Detail & Related papers (2025-08-03T13:11:52Z)
FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [57.577843653775]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z)
Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z)
Comparative Analysis of AI Agent Architectures for Entity Relationship Classification [1.6887793771613606]
In this study, we conduct a comparative analysis of three distinct AI agent architectures to perform relation classification.<n>The agentic architectures explored include (1) reflective self-evaluation, (2) hierarchical task decomposition, and (3) a novel multi-agent dynamic example generation mechanism.<n>Our experiments demonstrate that multi-agent coordination consistently outperforms standard few-shot prompting.
arXiv Detail & Related papers (2025-06-03T04:19:47Z)
STAR-Rec: Making Peace with Length Variance and Pattern Diversity in Sequential Recommendation [61.320991769685065]
STAR-Rec is a novel architecture that combines preference-aware attention and state-space modeling.<n>We show that STAR-Rec consistently outperforms state-of-the-art sequential recommendation methods.
arXiv Detail & Related papers (2025-05-06T12:40:38Z)
A Deep Learning Framework for Sequence Mining with Bidirectional LSTM and Multi-Scale Attention [11.999319439383918]
This paper addresses the challenges of mining latent patterns and modeling contextual dependencies in complex sequence data.<n>A sequence pattern mining algorithm is proposed by integrating Bidirectional Long Short-Term Memory (BiLSTM) with a multi-scale attention mechanism.<n>BiLSTM captures both forward and backward dependencies in sequences, enhancing the model's ability to perceive global contextual structures.
arXiv Detail & Related papers (2025-04-21T16:53:02Z)
Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models. Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z)
FormerTime: Hierarchical Multi-Scale Representations for Multivariate Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task. It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.