Fugu-MT 論文翻訳(概要): UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

論文の概要: UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

arxiv url: http://arxiv.org/abs/2510.13344v1
Date: Wed, 15 Oct 2025 09:30:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.59749
Title: UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE
Title（参考訳）: UniMoE-Audio:動的容量MOEを用いた統一音声と音楽生成
Authors: Zhenyu Liu, Yunxin Li, Xuanyu Zhang, Qixun Teng, Shenyuan Jiang, Xinyu Chen, Haoyuan Shi, Jinchao Li, Qi Wang, Haolan Chen, Fanbo Meng, Mingjun Zhao, Yu Xu, Yancheng He, Baotian Hu, Min Zhang,
Abstract要約: UniMoE-Audioは、新しいDynamic-Capacity Mixture-of-Experts (MoE)フレームワークにおける統一された音声および音楽生成モデルである。データ不均衡に対処するために,3段階の研修カリキュラムを導入する。 UniMoE-Audioは、主要な音声および音楽生成ベンチマークで最先端のパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 48.211103577288675
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE-Audio, a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each "proto-expert" without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: https://mukioxun.github.io/Uni-MoE-site/home.html
Abstract（参考訳）: 統合マルチモーダルモデルの最近の進歩は、包括的コンテンツ生成への明確な傾向を示している。しかし、聴覚領域は依然として重要な課題であり、音楽と音声はしばしば独立して発達し、普遍的な音声合成への進歩を妨げる。この分離は、真に統一されたオーディオ生成モデルの開発を妨げる、固有のタスクコンフリクトと厳しいデータ不均衡に起因する。この課題に対処するために,新しいダイナミック・キャパシティ・ミックス・オブ・エクササイズ(MoE)フレームワークにおいて,統一された音声・音楽生成モデルであるUniMoE-Audioを提案する。アーキテクチャ的には、UniMoE-Audioは動的専門家数割り当てのためのTop-Pルーティング戦略と、ドメイン固有の知識に関するルーティング専門家、ドメインに依存しない機能の共有専門家、適応的な計算スキップのためのnullエキスパートで構成されるハイブリッドエキスパート設計を導入している。データ不均衡に取り組むために,3段階の研修カリキュラムを導入する。 1)独立専門医養成は、元のデータセットを活用して、ドメイン固有の知識を干渉なく各「プロトエキスパート」に注入する。 2) MoE 統合と Warmup は、これらの専門家を UniMoE-Audio アーキテクチャに組み入れ、ゲートモジュールをウォームアップし、バランスの取れたデータセットのサブセットを使用して専門家を共有する。 3) シナジスティック・ジョイント・トレーニングは、完全なバランスの取れたデータセット上で、モデル全体をエンドツーエンドにトレーニングし、拡張されたクロスドメイン・シナジーを育む。広汎な実験により、UniMoE-Audioは、主要な音声および音楽生成ベンチマークで最先端のパフォーマンスを達成するだけでなく、より優れた相乗的学習を示し、ナイーブな関節トレーニングで見られるパフォーマンス劣化を軽減している。本研究は, ユニバーサル音声生成の分野を推し進める上で, 特殊なMoEアーキテクチャと訓練戦略の有意義な可能性を明らかにするものである。ホームページ:https://mukioxun.github.io/Uni-MoE-site/home.html

論文の概要: UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

関連論文リスト