Fugu-MT 論文翻訳(概要): MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation

論文の概要: MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation

arxiv url: http://arxiv.org/abs/2509.06389v1
Date: Mon, 08 Sep 2025 07:15:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-09 14:07:03.999604
Title: MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation
Title（参考訳）: ワンステップ生成による平均流速マルチモーダルビデオ・オーディオ合成
Authors: Xiaoran Yang, Jianxuan Yang, Xinyue Guo, Haoyu Wang, Ningning Pan, Gongping Huang,
Abstract要約: サイレントビデオから音声を合成する上で重要な課題は、合成品質と推論効率のトレードオフである。平均速度を用いて流れ場を特徴付ける平均流加速モデルを提案する。我々は,MeanFlowをネットワークに組み込むことで,知覚品質を損なうことなく推論速度を大幅に向上することを示した。
参考スコア（独自算出の注目度）: 12.665130073406651
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A key challenge in synthesizing audios from silent videos is the inherent trade-off between synthesis quality and inference efficiency in existing methods. For instance, flow matching based models rely on modeling instantaneous velocity, inherently require an iterative sampling process, leading to slow inference speeds. To address this efficiency bottleneck, we introduce a MeanFlow-accelerated model that characterizes flow fields using average velocity, enabling one-step generation and thereby significantly accelerating multimodal video-to-audio (VTA) synthesis while preserving audio quality, semantic alignment, and temporal synchronization. Furthermore, a scalar rescaling mechanism is employed to balance conditional and unconditional predictions when classifier-free guidance (CFG) is applied, effectively mitigating CFG-induced distortions in one step generation. Since the audio synthesis network is jointly trained with multimodal conditions, we further evaluate it on text-to-audio (TTA) synthesis task. Experimental results demonstrate that incorporating MeanFlow into the network significantly improves inference speed without compromising perceptual quality on both VTA and TTA synthesis tasks.
Abstract（参考訳）: サイレントビデオから音声を合成する上で重要な課題は、既存の手法における合成品質と推論効率のトレードオフである。例えば、フローマッチングに基づくモデルは、瞬時速度のモデリングに依存し、本質的に反復的なサンプリングプロセスを必要とし、推論速度が遅くなる。この効率ボトルネックに対処するために,平均速度を用いて流れ場を特徴付けるMeanFlow加速モデルを導入し,音質,セマンティックアライメント,時間同期を保ちながら,ワンステップ生成を可能にし,マルチモーダルビデオ・オーディオ(VTA)合成を著しく高速化する。さらに、分類器フリーガイダンス(CFG)を適用した場合、条件付きおよび非条件付き予測のバランスをとるためにスカラー再スケーリング機構を用い、1ステップ生成におけるCFG誘発歪みを効果的に軽減する。音声合成ネットワークはマルチモーダル条件で協調的に訓練されているため,テキスト・トゥ・オーディオ(TTA)合成タスクでさらに評価する。実験により,ネットワークにMeanFlowを組み込むことで,VTAおよびTTA合成タスクの知覚品質を損なうことなく,推論速度が大幅に向上することが示された。

論文の概要: MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation

関連論文リスト