Fugu-MT 論文翻訳(概要): Inference-Time Scaling for Joint Audio-Video Generation

論文の概要: Inference-Time Scaling for Joint Audio-Video Generation

arxiv url: http://arxiv.org/abs/2606.03183v1
Date: Tue, 02 Jun 2026 05:41:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 22:00:04.783383
Title: Inference-Time Scaling for Joint Audio-Video Generation
Title（参考訳）: 共同オーディオ映像生成のための推論時間スケーリング
Authors: Jaemin Jung, Kyeongha Rho, Inkyu Shin, Joon Son Chung,
Abstract要約: ジョイントオーディオビデオ生成モデルは、忠実性を改善するためにかなりのトレーニングリソースを必要とすることが多い。推論時間スケーリングは、単一のモダリティドメインにおいて、有望なトレーニング不要の代替手段である。共同音声・ビデオ生成のためのITSの総合的研究について紹介する。
参考スコア（独自算出の注目度）: 38.09471807128537
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Joint audio-video generation aims to synthesize realistic audio-video pairs that are both semantically aligned with text prompts and precisely synchronized. While existing joint audio-video generation models often require substantial training resources to improve fidelity, Inference-Time Scaling (ITS) has recently emerged as a promising training-free alternative in single-modality domains. However, extending ITS from a single modality to multimodal domains is non-trivial, as it requires balancing multiple heterogeneous objectives. In this paper, we present the first comprehensive study of ITS for joint audio-video generation. We first demonstrate that a multi-verifier framework is essential to address the limitations of single-objective guidance, including asymmetric performance trade-offs and verifier hacking. Through systematic analysis, we then identify an optimal multi-verifier combination that yields balanced improvements across all quality dimensions. Finally, to effectively aggregate diverse reward signals, we propose Adaptive Reward Weighting (ARW), a novel test-time optimization algorithm. ARW treats reward aggregation as an online optimization problem, utilizing learnable parameters to calibrate reward variances without requiring prior knowledge of reward distributions, thereby ensuring robust multi-objective selection. Experimental results on VGGSound and JavisBench-mini benchmarks demonstrate that our framework significantly enhances semantic alignment, perceptual quality, and audio-visual synchronization of generated outputs. Synthesized samples and code are available on the project page: https://jung-jaemin.github.io/ITS-AVGen-Proj.
Abstract（参考訳）: 共同音声-ビデオ生成は、テキストプロンプトにセマンティックに整合し、正確に同期されたリアルなオーディオ-ビデオペアを合成することを目的としている。既存のジョイントオーディオビデオ生成モデルは、忠実性を改善するためにかなりのトレーニングリソースを必要とすることが多いが、Inference-Time Scaling(ITS)は、最近、単一のモダリティドメインにおいて有望なトレーニングなしの代替手段として登場した。しかし、単一のモダリティからマルチモーダル領域へのITSの拡張は、複数の異種目的のバランスを必要とするため、非自明ではない。本稿では,共同音声・ビデオ生成のためのITSの総合的研究について紹介する。我々はまず,非対称な性能トレードオフや検証者ハッキングを含む単一目的誘導の限界に対処するために,マルチ検証フレームワークが不可欠であることを実証した。体系的な分析により、全ての品質次元でバランスのとれた改善をもたらす最適な多変量器の組み合わせを同定する。最後に、多様な報酬信号を効果的に集約するために、新しいテスト時間最適化アルゴリズムであるAdaptive Reward Weighting (ARW)を提案する。 ARWは、報酬集約をオンライン最適化問題として扱い、学習可能なパラメータを利用して報酬分布の事前知識を必要とせずに報酬分散を校正し、堅牢な多目的選択を保証する。 VGGSound と JavisBench-mini ベンチマークによる実験結果から,本フレームワークは生成した出力のセマンティックアライメント,知覚的品質,音声・視覚的同期を著しく向上することが示された。合成サンプルとコードはプロジェクトのページで公開されている。

論文の概要: Inference-Time Scaling for Joint Audio-Video Generation

関連論文リスト