Related papers: FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

URL: http://arxiv.org/abs/2511.11505v1
Date: Fri, 14 Nov 2025 17:25:14 GMT
Title: FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models
Authors: Yonatan Dukler, Guihong Li, Deval Shah, Vikram Appia, Emad Barsoum,
Abstract summary: We present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication.<n>We fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication.<n>In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations.
Score: 17.64873155970997
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged across a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks.

Related papers

UnicEdit-10M: A Dataset and Benchmark Breaking the Scale-Quality Barrier via Unified Verification for Reasoning-Enriched Edits [43.59555184340113]
We introduce a lightweight data pipeline that replaces multi-toolchains with an end-to-end model and a unified post-verification stage.<n>For scalable quality control, we train a 7B dual-task expert model, textbfQwen-Verify, for efficient failure detection and instruction recaptioning.<n>This pipeline yields textbfUnicEdit-10M, a 10M-scale dataset spanning diverse basic and complex editing tasks.
arXiv Detail & Related papers (2025-12-01T17:45:44Z)
Nonparametric Data Attribution for Diffusion Models [57.820618036556084]
Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs.<n>We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images.
arXiv Detail & Related papers (2025-10-16T03:37:16Z)
Intrinsic Training Signals for Federated Learning Aggregation [13.540945877050525]
Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy.<n>This work demonstrates that effective model merging can be achieved solely through existing training signals.
arXiv Detail & Related papers (2025-07-09T13:03:23Z)
Model Assembly Learning with Heterogeneous Layer Weight Merging [57.8462476398611]
We introduce Model Assembly Learning (MAL), a novel paradigm for model merging.<n>MAL integrates parameters from diverse models in an open-ended model zoo to enhance the base model's capabilities.
arXiv Detail & Related papers (2025-03-27T16:21:53Z)
Towards Compatible Fine-tuning for Vision-Language Model Updates [114.25776195225494]
Class-conditioned Context Optimization (ContCoOp) integrates learnable prompts with class embeddings using an attention layer before inputting them into the text encoder.<n>Our experiments over 15 datasets show that our ContCoOp achieves the highest compatibility over the baseline methods, and exhibits robust out-of-distribution generalization.
arXiv Detail & Related papers (2024-12-30T12:06:27Z)
HeteroTune: Efficient Federated Learning for Large Heterogeneous Models [35.53420882449293]
We propose HeteroTune, a novel federated fine-tuning paradigm for large, heterogeneous models operating under limited communication and budgets.<n>The core of our method lies in a novel architecture, DeMA, which enables flexible and efficient aggregation of heterogeneous models.<n>We provide both theoretical analysis and empirical evidence showing that HeteroTune achieves state-of-the-art performance and efficiency across diverse tasks and model architectures.
arXiv Detail & Related papers (2024-11-25T09:58:51Z)
EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods. EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z)
Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference [3.217776693788795]
We propose a lightweight optimization technique called ExFlow to largely accelerate the inference of pre-trained MoE models. By exploiting the inter-layer expert affinity, our solution can be directly applied to pre-trained MoE models without any fine-tuning or accuracy degradation. Our solution beats the cutting-edge MoE implementations with experts from 8 to 64, with up to 2.2x improvement in inference throughput.
arXiv Detail & Related papers (2024-01-16T14:16:47Z)
Precision-Recall Divergence Optimization for Generative Modeling with GANs and Normalizing Flows [54.050498411883495]
We develop a novel training method for generative models, such as Generative Adversarial Networks and Normalizing Flows. We show that achieving a specified precision-recall trade-off corresponds to minimizing a unique $f$-divergence from a family we call the textitPR-divergences. Our approach improves the performance of existing state-of-the-art models like BigGAN in terms of either precision or recall when tested on datasets such as ImageNet.
arXiv Detail & Related papers (2023-05-30T10:07:17Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [47.432215933099016]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.<n>This creates a barrier to fusing knowledge across individual models to yield a better single model.<n>We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
Bayesian Prompt Learning for Image-Language Model Generalization [64.50204877434878]
We use the regularization ability of Bayesian methods to frame prompt learning as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space.
arXiv Detail & Related papers (2022-10-05T17:05:56Z)
Slimmable Domain Adaptation [112.19652651687402]
We introduce a simple framework, Slimmable Domain Adaptation, to improve cross-domain generalization with a weight-sharing model bank. Our framework surpasses other competing approaches by a very large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-06-14T06:28:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.