Fugu-MT 論文翻訳(概要): V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

論文の概要: V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

arxiv url: http://arxiv.org/abs/2603.11042v1
Date: Wed, 11 Mar 2026 17:59:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:33.097233
Title: V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
Title（参考訳）: V2M-Zero:ゼロペアのタイムアライメントビデオ・ミュージック・ジェネレーション
Authors: Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan,
Abstract要約: V2M-Zero(V2M-Zero)は、ビデオのためのタイムアラインな音楽を出力するゼロペア・ビデオ・ツー・ミュージック・ジェネレーションのアプローチである。我々の手法は重要な観測によって動機付けられている: 時間同期は、いつ、どのくらいの変化が起こるかではなく、いつ、どのくらいの変化が起こるかの一致を必要とする。
参考スコア（独自算出の注目度）: 35.44526708016307
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/
Abstract（参考訳）: ビデオイベントと時間的に整合する音楽を生成することは、微粒な時間制御を欠く既存のテキスト-音楽モデルでは困難である。 V2M-Zeroは、ビデオのためのタイムアラインな音楽を出力するゼロペア・ビデオ・ツー・ミュージック・ジェネレーション手法である。我々の手法は重要な観察によって動機付けられている: 時間同期は、いつ、どのくらいの変化が起こるかではなく、いつ、どのくらいの変化が起こるかの一致を必要とする。音楽イベントと視覚イベントは意味的に異なるが、それぞれのモダリティ内で独立して捉えられる共通の時間構造を示す。事前学習した音楽とビデオエンコーダを用いて、モーダル内類似度から計算したイベント曲線を用いて、この構造をキャプチャする。それぞれのモダリティ内の時間的変化を独立に測定することにより、これらの曲線はモダリティにまたがる同等の表現を提供する。これにより、単純なトレーニング戦略が実現される: 音楽イベント曲線上でテキストから音楽へのモデルを微調整し、その後、クロスモーダルトレーニングやペアデータなしで推論時にビデオイベント曲線を置換する。 OES-Pub、MovieGenBench-Music、AIST++全体で、V2M-Zeroはペアデータベースラインよりも大幅に向上している: オーディオ品質が5-21%、セマンティックアライメントが13-15%、時間同期が21-52%改善、ダンスビデオのビートアライメントが28%向上した。また,クラウドソースによる大規模主観的聴取テストにより,同様の結果を得た。以上の結果から, 対の相互監視よりも, モーダル内特徴による時間的アライメントが, 映像から音楽への生成に有効であることが確認された。結果はhttps://genjib.github.io/v2m_zero/で公開されている。

論文の概要: V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

関連論文リスト