Fugu-MT 論文翻訳(概要): CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

論文の概要: CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

arxiv url: http://arxiv.org/abs/2605.18916v2
Date: Mon, 25 May 2026 12:15:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 22:28:52.048273
Title: CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation
Title（参考訳）: CounterFlow: 仮想ビデオフォリー生成のための2相推論時間サンプリング
Authors: Gyubin Lee, Junwon Lee, Juhan Nam,
Abstract要約: Inference-time dual-phase sample scheme for pretrained flow-matching VT2A model。フェーズ1は、視覚的に指示されたソースを抑えながら、ビデオ由来の時間構造を構築する。フェーズ2は、ビデオコンディショニングをドロップして、ターゲットプロンプトに向かって音声の音色を形作ることに完全にフォーカスする。
参考スコア（独自算出の注目度）: 17.978516888210542
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchored to the visually implied sound source when video and text contents disagree. We present ConterFlow, an inference-time dual-phase sampling scheme for pretrained flow-matching VT2A models. Phase 1 builds a video-derived temporal structure while suppressing the visually implied source; Phase 2 drops video conditioning to focus entirely on shaping audio timbre toward the target prompt. ConterFlow substantially improves counterfactual Video Foley generation compared to naive negative prompting and state-of-the-art baselines. To evaluate replacement quality, we propose a metric leveraging a text-audio co-embedding space to measure both target-prompt evidence and residual visually implied source leakage. Video demonstrations and code are available at https://gyubin-lee.github.io/counterflow-demo/
Abstract（参考訳）: 本研究では,映像に時間的同期を保ちながら,視覚的証拠に矛盾する音源の同一性を採用することを目的とした,ファクトファクト・ビデオ・フォーリー・ジェネレーションについて検討する。既存のビデオ・テキスト・トゥ・オーディオ(VT2A)モデルは、ビデオやテキストの内容が一致しない場合、しばしば視覚的に指示された音源に固定される。 Inference-time dual-phase sample scheme for pretrained flow-matching VT2A model。第1相は、視覚的に刺激された音源を抑えつつ、映像由来の時間構造を構築し、第2相は、映像条件付けを落とし、ターゲットプロンプトに向かって音声の音色を形作ることに集中する。 ConterFlowは、否定的なプロンプトや最先端のベースラインに比べて、反ファクトのVideo Foley生成を大幅に改善する。代替品の品質を評価するために,テキスト・オーディオの共埋め込み空間を利用して,ターゲット・プロンプト証拠と残差視覚的インリードソースリークの双方を計測する手法を提案する。ビデオデモとコードはhttps://gyubin-lee.github.io/counterflow-demo/で公開されている。

論文の概要: CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation

関連論文リスト