Fugu-MT 論文翻訳(概要): Diffusion Models for Joint Audio-Video Generation

論文の概要: Diffusion Models for Joint Audio-Video Generation

arxiv url: http://arxiv.org/abs/2603.16093v1
Date: Tue, 17 Mar 2026 03:31:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.088837
Title: Diffusion Models for Joint Audio-Video Generation
Title（参考訳）: 共同音声映像生成のための拡散モデル
Authors: Alejandro Paredes La Torre,
Abstract要約: 高品質でペアのオーディオビデオデータセットを2つリリースします。データセットのスクラッチからMM-拡散アーキテクチャをトレーニングします。逐次2段階のテキスト・オーディオ・ビデオ生成パイプラインを提案する。
参考スコア（独自算出の注目度）: 51.56484100374058
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.
Abstract（参考訳）: マルチモーダル生成モデルは、単一モダリティビデオとオーディオ合成において顕著な進歩を見せているが、真の共同オーディオビデオ生成は未解決の課題である。本稿では,この分野を前進させるための4つの重要な貢献について考察する。まず、高品質のペアオーディオビデオデータセットを2つリリースします。 13時間のビデオゲームクリップと64時間のコンサートパフォーマンスからなるデータセットは、それぞれ、再現可能な研究を促進するために、一貫した34秒のサンプルに分割された。第2に、MM-Diffusionアーキテクチャをデータセットのスクラッチからトレーニングし、セマンティックなコヒーレントなオーディオビデオペアを生成し、迅速なアクションと音楽的手がかりに基づいてアライメントを定量的に評価できることを示します。第3に,予め訓練されたビデオエンコーダとオーディオエンコーダデコーダを併用し,マルチモーダルデコーダにおける課題と不整合を明らかにすることで,共同潜伏拡散について検討する。最後に、まずビデオを生成し、その後、ビデオ出力と元のプロンプトの両方を条件付けして、時間同期音声を合成する、2段階のテキスト・オーディオ・ビデオ生成パイプラインを提案する。私の実験は、このモジュラーアプローチが高忠実度なオーディオビデオ生成を生み出すことを示している。

論文の概要: Diffusion Models for Joint Audio-Video Generation

関連論文リスト