Fugu-MT 論文翻訳(概要): MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

論文の概要: MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

arxiv url: http://arxiv.org/abs/2508.06098v2
Date: Wed, 22 Oct 2025 09:22:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:09.035017
Title: MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
Title（参考訳）: MeanAudio: 平均フローによる高速で忠実なテキスト・ツー・オーディオ生成
Authors: Xiquan Li, Junxi Liu, Yuzhe Liang, Zhikang Niu, Wenxi Chen, Xie Chen,
Abstract要約: MeanAudioは、1つの機能評価(1-NFE)だけで現実的な音をレンダリングできる高速で忠実なテキスト・オーディオ・ジェネレータである我々は,MeanAudioが単一ステップ音声生成における最先端性能を実現することを実証した。
参考スコア（独自算出の注目度）: 13.130255838403002
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent years have witnessed remarkable progress in Text-to-Audio Generation (TTA), providing sound creators with powerful tools to transform inspirations into vivid audio. Yet despite these advances, current TTA systems often suffer from slow inference speed, which greatly hinders the efficiency and smoothness of audio creation. In this paper, we present MeanAudio, a fast and faithful text-to-audio generator capable of rendering realistic sound with only one function evaluation (1-NFE). MeanAudio leverages: (i) the MeanFlow objective with guided velocity target that significantly accelerates inference speed, (ii) an enhanced Flux-style transformer with dual text encoders for better semantic alignment and synthesis quality, and (iii) an efficient instantaneous-to-mean curriculum that speeds up convergence and enables training on consumer-grade GPUs. Through a comprehensive evaluation study, we demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation. Specifically, it achieves a real-time factor (RTF) of 0.013 on a single NVIDIA RTX 3090, yielding a 100x speedup over SOTA diffusion-based TTA systems. Moreover, MeanAudio also shows strong performance in multi-step generation, enabling smooth transitions across successive synthesis steps.
Abstract（参考訳）: 近年、テキスト・トゥ・オーディオ・ジェネレーション(TTA)が顕著に進歩し、サウンド・クリエーターにインスピレーションを鮮明なオーディオに変換する強力なツールを提供している。しかし、これらの進歩にもかかわらず、現在のTTAシステムは推論速度が遅いため、オーディオ生成の効率と滑らかさを著しく損なうことも多い。本稿では,1つの機能評価(1-NFE)のみで現実的な音をレンダリングできる,高速で忠実なテキスト・オーディオ・ジェネレータであるMeanAudioについて述べる。 MeanAudioは次のように活用する。 (i)推論速度を著しく加速する誘導速度目標を持つMeanFlow目標。 (ii) セマンティックアライメントと合成品質を向上するデュアルテキストエンコーダを備えた拡張Flux型トランス (iii)コンバージェンスを高速化し,コンシューマグレードGPUのトレーニングを可能にする,効率的な即時学習カリキュラム。包括的評価研究を通じて,1段階の音声生成において,MeanAudioが最先端の性能を達成することを示す。具体的には、単一のNVIDIA RTX 3090上で0.013のリアルタイム係数(RTF)を達成し、SOTA拡散ベースのTTAシステムよりも100倍のスピードアップを実現している。さらに、MeanAudioはマルチステップ生成において高い性能を示し、連続した合成ステップ間のスムーズな遷移を可能にする。

論文の概要: MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

関連論文リスト