Fugu-MT 論文翻訳(概要): AVControl: Efficient Framework for Training Audio-Visual Controls

論文の概要: AVControl: Efficient Framework for Training Audio-Visual Controls

arxiv url: http://arxiv.org/abs/2603.24793v1
Date: Wed, 25 Mar 2026 20:06:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:47.980728
Title: AVControl: Efficient Framework for Training Audio-Visual Controls
Title（参考訳）: AVControl: オーディオ・ビジュアル制御を効果的に訓練するためのフレームワーク
Authors: Matan Ben-Yosef, Tavi Halperin, Naomi Ken Korem, Mohammad Salama, Harel Cain, Asaf Joseph, Anthony Chen, Urska Jelercic, Ofir Bibi,
Abstract要約: AVControlは、LTX-2上に構築され、ビデオとオーディオを制御する軽量で拡張可能なフレームワークである。それは、奥行き、ポーズ、エッジ、内在性のあるカメラ軌道、スパースモーションコントロール、ビデオ編集、そして私たちの知る限り、関節生成モデルのための最初のモジュール型オーディオ視覚制御など、独立に訓練された様々なモダリティをサポートします。それぞれのモダリティは小さなデータセットしか必要とせず、数百から数千のトレーニングステップに収束し、モノリシックな代替手段の予算のごく一部を占める。
参考スコア（独自算出の注目度）: 4.840804297125223
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.
Abstract（参考訳）: ビデオとオーディオ生成を制御するには、深さやポーズ、カメラの軌跡やオーディオ変換といった様々なモダリティが必要ですが、既存のアプローチでは、固定されたコントロールセットに対して単一のモノリシックモデルをトレーニングするか、新しいモダリティごとにコストがかかるアーキテクチャ変更を導入するかのどちらかがあります。 AVControlはLTX-2をベースとした軽量で拡張可能なフレームワークで,各制御モードを並列キャンバス上で個別のLoRAとしてトレーニングし,アテンション層に付加するトークンとして参照信号を提供する。画像ベースのインコンテキストメソッドをビデオに単純に拡張することは構造制御に失敗し、並列キャンバスアプローチがこれを解決していることを示す。 VACEベンチマークでは, 奥行きとポーズ誘導による生成, 塗り絵, 塗り絵で評価されたベースラインを全て上回り, カメラ制御とオーディオ視覚ベンチマークで競合する結果を示した。我々のフレームワークは、深度、ポーズ、エッジなどの空間的に整列した制御、内在性のあるカメラ軌道、スパースモーションコントロール、ビデオ編集、そして我々の知識により、関節生成モデルのための最初のモジュール型オーディオ視覚制御など、様々な独立して訓練されたモダリティをサポートします。それぞれのモダリティは小さなデータセットしか必要とせず、数百から数千のトレーニングステップに収束し、モノリシックな代替手段の予算のごく一部を占める。コードを公開し、LoRAチェックポイントをトレーニングしています。

論文の概要: AVControl: Efficient Framework for Training Audio-Visual Controls

関連論文リスト