Fugu-MT 論文翻訳(概要): MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

論文の概要: MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

arxiv url: http://arxiv.org/abs/2508.06098v1
Date: Fri, 08 Aug 2025 07:49:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-11 20:39:06.129199
Title: MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
Title（参考訳）: MeanAudio: 平均フローによる高速で忠実なテキスト・ツー・オーディオ生成
Authors: Xiquan Li, Junxi Liu, Yuzhe Liang, Zhikang Niu, Wenxi Chen, Xie Chen,
Abstract要約: MeanAudioはMeanFlowベースの新しいモデルで、高速で忠実なテキスト・オーディオ生成に適している。トレーニング中の平均速度場を後退させ、フロー軌跡の始点から終点へ直接マッピングすることで、高速な生成を可能にする。実験により、MeanAudioは1ステップの音声生成において最先端のパフォーマンスを達成することが示された。
参考スコア（独自算出の注目度）: 2.808913221639433
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent developments in diffusion- and flow- based models have significantly advanced Text-to-Audio Generation (TTA). While achieving great synthesis quality and controllability, current TTA systems still suffer from slow inference speed, which significantly limits their practical applicability. This paper presents MeanAudio, a novel MeanFlow-based model tailored for fast and faithful text-to-audio generation. Built on a Flux-style latent transformer, MeanAudio regresses the average velocity field during training, enabling fast generation by mapping directly from the start to the endpoint of the flow trajectory. By incorporating classifier-free guidance (CFG) into the training target, MeanAudio incurs no additional cost in the guided sampling process. To further stabilize training, we propose an instantaneous-to-mean curriculum with flow field mix-up, which encourages the model to first learn the foundational instantaneous dynamics, and then gradually adapt to mean flows. This strategy proves critical for enhancing training efficiency and generation quality. Experimental results demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation. Specifically, it achieves a real time factor (RTF) of 0.013 on a single NVIDIA RTX 3090, yielding a 100x speedup over SOTA diffusion-based TTA systems. Moreover, MeanAudio also demonstrates strong performance in multi-step generation, enabling smooth and coherent transitions across successive synthesis steps.
Abstract（参考訳）: 近年の拡散・流動モデルの発展により,テキスト・ツー・オーディオ・ジェネレーション(TTA)が大幅に進歩している。優れた合成品質と制御性を達成する一方で、現在のTTAシステムは依然として推論速度が遅く、実用性が著しく制限されている。本稿では,高速で忠実なテキスト・オーディオ生成に適したMeanFlowベースの新しいモデルであるMeanAudioについて述べる。フラックス式潜水トランス上に構築されたMeanAudioは、トレーニング中の平均速度場を回帰し、フロー軌道の始点から終点への直接マッピングによって高速な生成を可能にする。 MeanAudioは、分類器フリーガイダンス(CFG)をトレーニング対象に組み込むことで、ガイドされたサンプリングプロセスに追加のコストを発生させない。さらにトレーニングの安定化を図るため,フローフィールドを混合した即時学習カリキュラムを提案し,まず基礎的な瞬間力学を学習し,その後,平均フローに徐々に適応させる。この戦略は、訓練効率と生成品質を向上させるために重要である。実験結果から,MeanAudioは1段階の音声生成において最先端の性能を実現することが示された。具体的には、単一のNVIDIA RTX 3090上で0.013のリアルタイム係数(RTF)を達成し、SOTA拡散ベースのTTAシステムよりも100倍のスピードアップを実現している。さらに、MeanAudioはマルチステップ生成において強い性能を示し、連続した合成ステップ間で滑らかでコヒーレントな遷移を可能にする。

論文の概要: MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

関連論文リスト