Fugu-MT 論文翻訳(概要): EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

論文の概要: EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

arxiv url: http://arxiv.org/abs/2512.24731v1
Date: Wed, 31 Dec 2025 08:58:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-01 23:27:28.613826
Title: EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
Title（参考訳）: EchoFoley:ビデオグラウンドの創造音生成のためのイベント中心階層制御
Authors: Bingxuan Li, Yiming Cui, Yicheng He, Yiwei Wang, Shu Zhang, Longyin Wen, Yulei Niu,
Abstract要約: 本稿では,イベントレベルの局所制御と階層的セマンティック制御を併用したビデオグラウンド音声生成のためのタスクであるEchoFoleyを紹介する。発声イベントのシンボリック表現は、ビデオやインストラクション内で各音がいつ、何、どのように生成されるかを指定する。実験の結果、EchoVidiaは最新のVT2Aモデルよりも40.7%、知覚品質は12.5%向上した。
参考スコア（独自算出の注目度）: 33.6858214966905
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound generation, insertion, and editing. To support this task, we construct EchoFoley-6k, a large-scale, expert-curated benchmark containing over 6,000 video-instruction-annotation triplets. Building upon this foundation, we propose EchoVidia a sounding-event-centric agentic generation framework with slow-fast thinking strategy. Experiments show that EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.
Abstract（参考訳）: サウンドエフェクトはマルチモーダルなストーリーテリングの不可欠なレイヤを構築し、感情的な雰囲気とビデオの物語的セマンティクスを形成する。ビデオ・テキスト・トゥ・オーディオ(VT2A)の最近の進歩にもかかわらず、現在の定式化は3つの重要な制限に直面している。これらの制約に対処するために,イベントレベルの局所制御と階層的セマンティック制御を併用したビデオグラウンド音声生成のための新しいタスクであるEchoFoleyを導入する。発声イベントのシンボル表現は、ビデオや命令内で各音がいつ、何、どのように生成され、音の生成、挿入、編集などのきめ細かい制御を可能にするかを指定する。このタスクを支援するために,6,000以上のビデオインストラクションアノテーションを含む大規模で専門家によるベンチマークであるEchoFoley-6kを構築した。この基盤の上に構築されたEchoVidiaは,スローファストな思考戦略を備えた音響イベント中心のエージェント生成フレームワークである。実験の結果、EchoVidiaは最新のVT2Aモデルよりも40.7%、知覚品質は12.5%向上した。

論文の概要: EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

関連論文リスト