Fugu-MT 論文翻訳(概要): Seeing Voices: Generating A-Roll Video from Audio with Mirage

論文の概要: Seeing Voices: Generating A-Roll Video from Audio with Mirage

arxiv url: http://arxiv.org/abs/2506.08279v1
Date: Mon, 09 Jun 2025 22:56:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-11 15:11:40.867271
Title: Seeing Voices: Generating A-Roll Video from Audio with Mirage
Title（参考訳）: Seeing Voices:ミラージュでオーディオからA-Rollビデオを生成する
Authors: Aditi Sundararaman, Amogh Adishesha, Andrew Jaegle, Dan Bigioi, Hyoung-Kyu Song, Jon Kyl, Justin Mao, Kevin Lan, Mojtaba Komeili, ShahRukh Athar, Sheila Babayan, Stanislau Beliasau, William Buchwalter,
Abstract要約: ビデオ生成への現在のアプローチは、音声を無視して汎用的だがサイレントな画像シーケンス生成に焦点を当てている。音声入力が与えられたスクラッチからリアルで表現力のある出力画像を生成するのに優れるオーディオ・ビデオ基盤モデルであるMirageを紹介する。
参考スコア（独自算出の注目度）: 12.16029287095035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: From professional filmmaking to user-generated content, creators and consumers have long recognized that the power of video depends on the harmonious integration of what we hear (the video's audio track) with what we see (the video's image sequence). Current approaches to video generation either ignore sound to focus on general-purpose but silent image sequence generation or address both visual and audio elements but focus on restricted application domains such as re-dubbing. We introduce Mirage, an audio-to-video foundation model that excels at generating realistic, expressive output imagery from scratch given an audio input. When integrated with existing methods for speech synthesis (text-to-speech, or TTS), Mirage results in compelling multimodal video. When trained on audio-video footage of people talking (A-roll) and conditioned on audio containing speech, Mirage generates video of people delivering a believable interpretation of the performance implicit in input audio. Our central technical contribution is a unified method for training self-attention-based audio-to-video generation models, either from scratch or given existing weights. This methodology allows Mirage to retain generality as an approach to audio-to-video generation while producing outputs of superior subjective quality to methods that incorporate audio-specific architectures or loss components specific to people, speech, or details of how images or audio are captured. We encourage readers to watch and listen to the results of Mirage for themselves (see paper and comments for links).
Abstract（参考訳）: プロの映画制作からユーザー生成コンテンツに至るまで、クリエイターや消費者は、ビデオのパワーは、聞くもの(ビデオのオーディオトラック)と見るもの(ビデオのイメージシーケンス)の調和した統合に依存していると長年認識してきた。ビデオ生成への現在のアプローチでは、音声を無視して汎用的だがサイレントな画像シーケンス生成にフォーカスするか、視覚的およびオーディオ的要素の両方に対処するが、リダビングのような制限されたアプリケーションドメインにフォーカスする。音声入力が与えられたスクラッチからリアルで表現力のある出力画像を生成するのに優れるオーディオ・ビデオ基盤モデルであるMirageを紹介する。既存の音声合成手法(テキスト音声、TS)と統合すると、Mirageは説得力のあるマルチモーダルビデオとなる。音声(Aロール)と音声(音声を含む音声)の映像を訓練すると、Mirageは、入力オーディオに暗黙的なパフォーマンスの信頼できる解釈を提供する人々のビデオを生成する。私たちの中心となる技術的貢献は、スクラッチから、あるいは既存の重みを与えられた自己注意に基づくオーディオ・ビデオ生成モデルをトレーニングするための統一的な方法です。この手法により、Mirageはオーディオ・ビデオ・ジェネレーションへのアプローチとして一般性を保ちつつ、より優れた主観的品質の出力を、人、音声、または画像やオーディオのキャプチャー方法の詳細に特有の、オーディオ固有のアーキテクチャや損失要素を含む方法に組み込むことができる。私たちは読者に対して、Mirageの結果を自分自身で見て聞くように勧めています(リンクに関する論文やコメントを参照)。

論文の概要: Seeing Voices: Generating A-Roll Video from Audio with Mirage

関連論文リスト