Related papers: CoCoAFusE: Beyond Mixtures of Experts via Model Fusion

CoCoAFusE: Beyond Mixtures of Experts via Model Fusion

URL: http://arxiv.org/abs/2505.01105v1
Date: Fri, 02 May 2025 08:35:04 GMT
Title: CoCoAFusE: Beyond Mixtures of Experts via Model Fusion
Authors: Aurelio Raffa Ugolini, Mara Tanelli, Valentina Breschi,
Abstract summary: CoCoAFusE builds on the philosophy behind Mixtures of Experts (MoEs)<n>Our formulation extends that of a classical Mixture of Experts by contemplating the fusion of the experts' distributions.<n>This new approach is showcased extensively on a suite of motivating numerical examples and a collection of real-data ones.
Score: 3.501882879116058
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many learning problems involve multiple patterns and varying degrees of uncertainty dependent on the covariates. Advances in Deep Learning (DL) have addressed these issues by learning highly nonlinear input-output dependencies. However, model interpretability and Uncertainty Quantification (UQ) have often straggled behind. In this context, we introduce the Competitive/Collaborative Fusion of Experts (CoCoAFusE), a novel, Bayesian Covariates-Dependent Modeling technique. CoCoAFusE builds on the very philosophy behind Mixtures of Experts (MoEs), blending predictions from several simple sub-models (or "experts") to achieve high levels of expressiveness while retaining a substantial degree of local interpretability. Our formulation extends that of a classical Mixture of Experts by contemplating the fusion of the experts' distributions in addition to their more usual mixing (i.e., superimposition). Through this additional feature, CoCoAFusE better accommodates different scenarios for the intermediate behavior between generating mechanisms, resulting in tighter credible bounds on the response variable. Indeed, only resorting to mixing, as in classical MoEs, may lead to multimodality artifacts, especially over smooth transitions. Instead, CoCoAFusE can avoid these artifacts even under the same structure and priors for the experts, leading to greater expressiveness and flexibility in modeling. This new approach is showcased extensively on a suite of motivating numerical examples and a collection of real-data ones, demonstrating its efficacy in tackling complex regression problems where uncertainty is a key quantity of interest.

Related papers

Drift-aware Collaborative Assistance Mixture of Experts for Heterogeneous Multistream Learning [31.877595633244734]
Learning from multiple data streams in real-world scenarios is fundamentally challenging due to intrinsic heterogeneity and unpredictable concept drifts.<n>Existing methods typically assume homogeneous streams and employ static architectures with indiscriminate knowledge fusion.<n>We propose CAMEL, a framework that assigns each stream an independent system with a dedicated feature extractor and task-specific head.<n>Furthermore, we propose an Autonomous Expert Tuner (AET) strategy, which dynamically manages expert lifecycles in response to drift.
arXiv Detail & Related papers (2025-08-03T05:35:34Z)
CoCoA-Mix: Confusion-and-Confidence-Aware Mixture Model for Context Optimization [9.888839721140231]
We propose a confusion-aware loss (CoA-loss) that improves specialization by refining the decision boundaries between confusing classes.<n>We mathematically demonstrate that a mixture model can enhance generalization without compromising specialization.<n>CoCoA-Mix, a mixture model with CoA-loss and CoA-weights, outperforms state-of-the-art methods by enhancing specialization and generalization.
arXiv Detail & Related papers (2025-06-09T07:04:47Z)
Enhancing CTR Prediction with De-correlated Expert Networks [53.05653547330796]
We propose a De-Correlated MoE (D-MoE) framework, which introduces a Cross-Expert De-Correlation loss to minimize expert correlations.<n>Extensive experiments have been conducted to validate the effectiveness of D-MoE and the de-correlation principle.
arXiv Detail & Related papers (2025-05-23T14:04:38Z)
CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning [5.161314094237747]
We propose Contrastive Representation for MoE (CoMoE) to promote modularization and specialization in MoE.<n>Experiments on several benchmarks and in multi-task settings demonstrate that CoMoE can consistently enhance MoE's capacity and promote modularization among the experts.
arXiv Detail & Related papers (2025-05-23T06:58:44Z)
A Unified Virtual Mixture-of-Experts Framework:Enhanced Inference and Hallucination Mitigation in Single-Model System [9.764336669208394]
Generative models, such as GPT and BERT, have significantly improved performance in tasks like text generation and summarization.<n>However, hallucinations "where models generate non-factual or misleading content" are especially problematic in smaller-scale architectures.<n>We propose a unified Virtual Mixture-of-Experts (MoE) fusion strategy that enhances inference performance and mitigates hallucinations in a single Qwen 1.5 0.5B model.
arXiv Detail & Related papers (2025-04-01T11:38:01Z)
Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.<n>Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.<n>We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z)
Complexity Experts are Task-Discriminative Learners for Any Image Restoration [80.46313715427928]
We introduce complexity experts" -- flexible expert blocks with varying computational complexity and receptive fields.<n>This preference effectively drives task-specific allocation, assigning tasks to experts with the appropriate complexity.<n>The proposed MoCE-IR model outperforms state-of-the-art methods, affirming its efficiency and practical applicability.
arXiv Detail & Related papers (2024-11-27T15:58:07Z)
Retraining-Free Merging of Sparse MoE via Hierarchical Clustering [14.858134039539697]
This paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE)<n>HC-SMoE is a task-agnostic expert merging framework for parameter reduction without retraining.<n>We provide theoretical analysis and evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE's effectiveness in state-of-the-art models including Qwen and Mixtral.
arXiv Detail & Related papers (2024-10-11T07:36:14Z)
Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study [65.11303133775857]
Mixture-of-Experts (MoE) computation amalgamates predictions from several specialized sub-models (referred to as experts) Sparse MoE selectively engages only a limited number, or even just one expert, significantly reducing overhead while empirically preserving, and sometimes even enhancing, performance.
arXiv Detail & Related papers (2024-03-26T05:48:02Z)
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations. A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization. We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z)
On Least Square Estimation in Softmax Gating Mixture of Experts [78.3687645289918]
We investigate the performance of the least squares estimators (LSE) under a deterministic MoE model. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. Our findings have important practical implications for expert selection.
arXiv Detail & Related papers (2024-02-05T12:31:18Z)
Mixture of Tokens: Continuous MoE through Cross-Example Aggregation [0.7880651741080428]
Mixture of Experts (MoE) models are pushing the boundaries of language and vision tasks. MoT is a simple, continuous architecture that is capable of scaling the number of parameters similarly to sparse MoE models. Our best models achieve a 3x increase in training speed over dense Transformer models in language pretraining.
arXiv Detail & Related papers (2023-10-24T16:03:57Z)
Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer [59.43462055143123]
The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning. In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity. We propose an alternating training strategy that encourages each expert to update in a direction to the subspace spanned by other experts.
arXiv Detail & Related papers (2023-10-15T07:20:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.