UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE
- URL: http://arxiv.org/abs/2510.13344v1
- Date: Wed, 15 Oct 2025 09:30:25 GMT
- Title: UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE
- Authors: Zhenyu Liu, Yunxin Li, Xuanyu Zhang, Qixun Teng, Shenyuan Jiang, Xinyu Chen, Haoyuan Shi, Jinchao Li, Qi Wang, Haolan Chen, Fanbo Meng, Mingjun Zhao, Yu Xu, Yancheng He, Baotian Hu, Min Zhang,
- Abstract summary: UniMoE-Audio is a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework.<n>To tackle data imbalance, we introduce a three-stage training curriculum.<n>UniMoE-Audio achieves state-of-the-art performance on major speech and music generation benchmarks.
- Score: 48.211103577288675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE-Audio, a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each "proto-expert" without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: https://mukioxun.github.io/Uni-MoE-site/home.html
Related papers
- MOVA: Towards Scalable and Synchronized Video-Audio Generation [91.56945636522345]
We introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content.<n>By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators.
arXiv Detail & Related papers (2026-02-09T15:31:54Z) - ERNIE 5.0 Technical Report [244.36480708815316]
ERNIE 5.0 is a unified autoregressive foundation model for unified multimodal understanding and generation across text, image, video, and audio.<n>To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm.<n>We show that ERNIE 5.0 achieves strong and balanced performance across multiple modalities.
arXiv Detail & Related papers (2026-02-04T16:18:15Z) - MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts [12.42628977620548]
MoST (Mixture of Speech and Text) is a novel large language model that seamlessly integrates speech and text processing.<n>We introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type.<n>MoST consistently outperforms existing models of comparable parameter counts.
arXiv Detail & Related papers (2026-01-15T10:43:29Z) - MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free [27.346096262060787]
We introduce the textittextbfMoE-Adapter, a sparse Mixture-of-Experts(MoE) architecture designed to decouple acoustic information.<n>Experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks.
arXiv Detail & Related papers (2026-01-06T12:24:38Z) - SAM Audio: Segment Anything in Audio [55.50609519820557]
General audio source separation is a key capability for multimodal AI systems.<n>We present SAM Audio, a foundation model for general audio separation.<n>It unifies text, visual, and temporal span prompting within a single framework.
arXiv Detail & Related papers (2025-12-19T22:14:23Z) - Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms [55.1784306456972]
Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference.<n>We use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors.<n>We uncover several findings: (1) neuron utilization decreases as models evolve, reflecting stronger generalization; (2) training exhibits a dynamic trajectory, where benchmark performance alone provides limited signal; (3) task completion emerges from collaborative contributions of multiple experts, with shared experts driving concentration; and (4) activation patterns at the neuron level provide a fine-grained proxy for data diversity.
arXiv Detail & Related papers (2025-09-28T15:13:38Z) - Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts [18.18231276284727]
Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely.<n>Recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts.<n>This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models.
arXiv Detail & Related papers (2025-09-23T02:07:14Z) - MambAttention: Mamba with Multi-Head Attention for Generalizable Single-Channel Speech Enhancement [26.526517674876086]
We propose a novel hybrid architecture, MambAttention, which combines Mamba and shared time- and frequency-multi-head attention modules.<n>Our proposed MambAttention model significantly outperforms existing state-of-the-art LSTM-, xLSTM-, Mamba-, and Conformer-based systems.
arXiv Detail & Related papers (2025-07-01T17:16:05Z) - Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks [4.132793413136553]
We introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism.
The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention.
arXiv Detail & Related papers (2023-09-14T14:51:51Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.<n> Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.