Fugu-MT 論文翻訳(概要): Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

論文の概要: Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

arxiv url: http://arxiv.org/abs/2603.08126v1
Date: Mon, 09 Mar 2026 09:06:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:15.721328
Title: Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows
Title（参考訳）: Foley-Flow:マズード・オーディオ・ビジュアル・アライメントとダイナミック・コンディショナル・フローを用いたコーディネート・ビデオ・ツー・オーディオ・ジェネレーション
Authors: Shentong Mo, Yibing Song,
Abstract要約: ビデオ入力に基づくコーディネートオーディオ生成は、通常、厳格なオーディオ・ビジュアル・アライメント(AV)を必要とする。マスク付きモデリングトレーニングにより,まずFoleyFlowを用いて非モード型AVエンコーダのアライメントを行う。トレーニング後、単調データのみを用いて個別に事前訓練されたAVエンコーダは、意味的およびリズム的整合性に整合する。
参考スコア（独自算出の注目度）: 75.44753202066171
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Coordinated audio generation based on video inputs typically requires a strict audio-visual (AV) alignment, where both semantics and rhythmics of the generated audio segments shall correspond to those in the video frames. Previous studies leverage a two-stage design where the AV encoders are firstly aligned via contrastive learning, then the encoded video representations guide the audio generation process. We observe that both contrastive learning and global video guidance are effective in aligning overall AV semantics while limiting temporally rhythmic synchronization. In this work, we propose FoleyFlow to first align unimodal AV encoders via masked modeling training, where the masked audio segments are recovered under the guidance of the corresponding video segments. After training, the AV encoders which are separately pretrained using only unimodal data are aligned with semantic and rhythmic consistency. Then, we develop a dynamic conditional flow for the final audio generation. Built upon the efficient velocity flow generation framework, our dynamic conditional flow utilizes temporally varying video features as the dynamic condition to guide corresponding audio segment generations. To this end, we extract coherent semantic and rhythmic representations during masked AV alignment, and use this representation of video segments to guide audio generation temporally. Our audio results are evaluated on the standard benchmarks and largely surpass existing results under several metrics. The superior performance indicates that FoleyFlow is effective in generating coordinated audios that are both semantically and rhythmically coherent to various video sequences.
Abstract（参考訳）: ビデオ入力に基づくコーディネート音声生成は、典型的には厳密な音声視覚アライメントを必要とし、生成した音声セグメントのセマンティクスとリズミクスの両方がビデオフレーム内のものと対応する。これまでの研究では、AVエンコーダがまずコントラスト学習によって整列され、次に符号化されたビデオ表現が音声生成プロセスのガイドとなる2段階の設計を用いていた。コントラスト学習とグローバルビデオ指導の両方が、時間的リズム同期を制限しながら、全体的なAVセマンティクスの整合に有効であることを示す。本研究では,FoleyFlowを用いて,マスク付き音響セグメントを対応するビデオセグメントの誘導の下で復元するマスク付きモデリングトレーニングにより,まず,非モード型AVエンコーダをアライメントする。トレーニング後、単調データのみを用いて個別に事前訓練されたAVエンコーダは、意味的およびリズム的整合性に整合する。そして,最終的な音声生成のための動的条件付きフローを開発する。高速な速度流生成フレームワークを基盤として,時間的に変化する映像特徴を動的条件として利用し,対応する音声セグメントの世代を導出する。この目的のために、マスク付きAVアライメント中のコヒーレントなセマンティック・リズミカルな表現を抽出し、ビデオセグメントのこの表現を用いて、時間的に音声生成を誘導する。オーディオ結果は標準ベンチマークで評価され,いくつかの指標で既存の結果を上回っている。優れた性能は、FoleyFlowが様々なビデオシーケンスにセマンティックかつリズミカルに整合したコーディネートオーディオを生成するのに有効であることを示している。

論文の概要: Foley-Flow: Coordinated Video-to-Audio Generation with Masked Audio-Visual Alignment and Dynamic Conditional Flows

関連論文リスト