Fugu-MT 論文翻訳(概要): ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

論文の概要: ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

arxiv url: http://arxiv.org/abs/2604.15086v1
Date: Thu, 16 Apr 2026 14:47:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.954646
Title: ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling
Title（参考訳）: ControlFoley: クロスモーダル・コンフリクト・ハンドリングによる統一かつ制御可能なビデオ・ツー・オーディオ・ジェネレーション
Authors: Jianxuan Yang, Xinyue Guo, Zhi Cheng, Kai Wang, Lipan Zhang, Jinjie Hu, Qiang Ji, Yihua Cao, Yihao Meng, Zhaoyue Cui, Mengmei Liu, Meng Meng, Jian Luan,
Abstract要約: 最近のイン・トゥ・オーディオ(V2A)は、視覚コンテンツから高品質なオーディオ合成を可能にする。既存の手法は、視覚的テキストの衝突とスタイル制御の下で、弱いテキスト制御性に悩まされている。ビデオ,テキスト,参照音声を正確に体系的に制御できる統合マルチモーダルV2AフレームワークであるControlFoleyを提案する。
参考スコア（独自算出の注目度）: 27.798767691628825
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound-TVC, a benchmark for evaluating textual controllability under varying degrees of visual-text conflict. Extensive experiments demonstrate state-of-the-art performance across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. ControlFoley achieves superior controllability under cross-modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system. Code, models, datasets, and demos are available at: https://yjx-research.github.io/ControlFoley/.
Abstract（参考訳）: 近年のV2A(Video-to-audio)生成により,映像コンテンツから高品質な音声合成が可能となった。既存の手法では,参照音声における時間情報と音色情報との絡み合いによる,視覚的テキストの衝突や不正確なスタイル制御において,弱いテキスト制御性に悩まされている。さらに、標準化されたベンチマークの欠如は、体系的な評価を制限する。ビデオ,テキスト,参照音声の正確な制御を可能にする統合マルチモーダルV2AフレームワークであるControlFoleyを提案する。本稿では,CLIPと時空間オーディオ-視覚エンコーダを統合し,アライメントとテキスト制御性を改善する共同視覚符号化パラダイムを提案する。さらに,識別的音色の特徴を保ちながら,余分な時間的手がかりを抑えるため,時間的音色分離を提案する。さらに,マルチモーダル・アライメント・アライメント(REPA)とランダムなモダリティ・ドロップアウトを備えたモダリティ・ロバスト・トレーニング・スキームを設計する。また、VGGSound-TVCは、視覚的テキストのコンフリクトの度合いの異なるテキスト制御性を評価するためのベンチマークである。大規模な実験では、テキスト誘導、テキスト制御、音声制御生成など、複数のV2Aタスクにわたる最先端のパフォーマンスが実証されている。 ControlFoleyは、強い同期とオーディオ品質を維持しながら、クロスモーダルコンフリクト下で優れた制御性を実現し、産業用V2Aシステムと比較して、競争力または優れた性能を示す。コード、モデル、データセット、デモは、https://yjx-research.github.io/ControlFoley/.comで公開されている。

論文の概要: ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

関連論文リスト