Related papers: MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models

MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models

URL: http://arxiv.org/abs/2512.12121v1
Date: Sat, 13 Dec 2025 01:22:10 GMT
Title: MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models
Authors: Ahmad Chamma, Omar El Herraoui, Guokan Shang,
Abstract summary: We introduce a modular open-source framework for constructing, training, and analyzing Mixture-of-Experts (MoE) models from arbitrary pre-trained or fine-tuned models.<n> MixtureKit supports three complementary methods: (i) emphTraditional MoE, which uses a single router per transformer block to select experts, (ii) emphBTX (Branch-Train-Mix), which introduces separate routers for each specified sub-layer enabling fine-grained token routing.
Score: 6.3350351413826544
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce MixtureKit, a modular open-source framework for constructing, training, and analyzing Mixture-of-Experts (MoE) models from arbitrary pre-trained or fine-tuned models. MixtureKit currently supports three complementary methods: (i) \emph{Traditional MoE}, which uses a single router per transformer block to select experts, (ii) \emph{BTX} (Branch-Train-Mix), which introduces separate routers for each specified sub-layer enabling fine-grained token routing, and (iii) \emph{BTS} (Branch-Train-Stitch), which keeps experts fully intact and introduces trainable stitch layers for controlled information exchange between hub and experts. MixtureKit automatically modifies the model configuration, patches decoder and causal LM classes, and saves a unified checkpoint ready for inference or fine-tuning. We further provide a visualization interface to inspect per-token routing decisions, expert weight distributions, and layer-wise contributions. Experiments with multilingual code-switched data (e.g. Arabic-Latin) show that a BTX-based model trained using MixtureKit can outperform baseline dense models on multiple benchmarks. We release MixtureKit as a practical foundation for research and development of MoE-based systems across diverse domains.

Related papers

MIP Candy: A Modular PyTorch Framework for Medical Image Processing [4.024630879799288]
MIPCandy is a PyTorch-based framework designed specifically for medical image processing.<n>Central to the design is $textttLayerT$, a deferred configuration mechanism that enables runtime substitution of convolution, normalization, and activation modules.<n>MIPCandy is open-source under the Apache-2.0 license and requires Python3.12 or later.
arXiv Detail & Related papers (2026-02-24T15:55:04Z)
TokaMind: A Multi-Modal Transformer Foundation Model for Tokamak Plasma Dynamics [56.073642366268764]
TokaMind is an open-source foundation model framework for fusion plasma modeling.<n>It is trained on heterogeneous tokamak diagnostics from the publicly available MAST dataset.<n>We evaluate TokaMind on the recently introduced MAST benchmark TokaMark.
arXiv Detail & Related papers (2026-02-16T12:26:07Z)
MaD-Mix: Multi-Modal Data Mixtures via Latent Space Coupling for Vision-Language Model Training [54.78779514101305]
MaD-Mix is a principled framework that derives multi-modal data mixtures for VLM training.<n>MaD-Mix speeds VLM training across diverse benchmarks.<n>In complex tri-modal video-image-text scenarios, MaD-Mix boosts average accuracy over uniform weights, with negligible mixture overhead.
arXiv Detail & Related papers (2026-02-08T03:07:36Z)
MoSE: Mixture of Slimmable Experts for Efficient and Adaptive Language Models [28.87682703032017]
Mixture-of-Experts (MoE) models scale large language models efficiently by sparsely activating experts, but once an expert is selected, it is executed fully.<n>We propose Mixture of Slimmable Experts (MoSE), an MoE architecture in which each expert has a nested, slimmable structure that can be executed at variable widths.
arXiv Detail & Related papers (2026-02-05T19:48:41Z)
Mixtera: A Data Plane for Foundation Model Training [1.797352319167759]
We build and present Mixtera, a data plane for foundation model training.<n>We show that Mixtera does not bottleneck training and scales to 256 GH200 superchips.<n>We also explore the role of mixtures for vision-language models.
arXiv Detail & Related papers (2025-02-27T05:55:44Z)
MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts [38.15244333975921]
MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance.
arXiv Detail & Related papers (2024-07-13T09:22:33Z)
Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM [81.18305296110853]
We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion. BTX generalizes two special cases, the Branch-Train-Merge method, which does not have the MoE finetuning stage to learn routing, and sparse upcycling, which omits the stage of training experts asynchronously.
arXiv Detail & Related papers (2024-03-12T16:54:58Z)
Mixture-Models: a one-stop Python Library for Model-based Clustering using various Mixture Models [4.60168321737677]
textttMixture-Models is an open-source Python library for fitting Gaussian Mixture Models (GMM) and their variants. It streamlines the implementation and analysis of these models using various first/second order optimization routines. The library provides user-friendly model evaluation tools, such as BIC, AIC, and log-likelihood estimation.
arXiv Detail & Related papers (2024-02-08T19:34:24Z)
Task-customized Masked AutoEncoder via Mixture of Cluster-conditional Experts [104.9871176044644]
Masked Autoencoder(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training. We propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE) MoCE trains each expert only with semantically relevant images by using cluster-conditional gates.
arXiv Detail & Related papers (2024-02-08T03:46:32Z)
ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models [51.35570730554632]
ESPnet-SPK is a toolkit for training speaker embedding extractors. We provide several models, ranging from x-vector to recent SKA-TDNN. We also aspire to bridge developed models with other domains.
arXiv Detail & Related papers (2024-01-30T18:18:27Z)
MatFormer: Nested Transformer for Elastic Inference [91.45687988953435]
MatFormer is a novel Transformer architecture designed to provide elastic inference across diverse deployment constraints.<n>MatFormer achieves this by incorporating a nested Feed Forward Network (FFN) block structure within a standard Transformer model.<n>We show that a 850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters.
arXiv Detail & Related papers (2023-10-11T17:57:14Z)
OpenMixup: Open Mixup Toolbox and Benchmark for Visual Representation Learning [53.57075147367114]
We introduce OpenMixup, the first mixup augmentation and benchmark for visual representation learning. We train 18 representative mixup baselines from scratch and rigorously evaluate them across 11 image datasets. We also open-source our modular backbones, including a collection of popular vision backbones, optimization strategies, and analysis toolkits.
arXiv Detail & Related papers (2022-09-11T12:46:01Z)
StableMoE: Stable Routing Strategy for Mixture of Experts [109.0602120199226]
Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead. We propose StableMoE with two training stages to address the routing fluctuation problem. Results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.
arXiv Detail & Related papers (2022-04-18T16:48:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.