Fugu-MT 論文翻訳(概要): Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

論文の概要: Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

arxiv url: http://arxiv.org/abs/2605.17488v1
Date: Sun, 17 May 2026 14:56:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:48.117881
Title: Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
Title（参考訳）: Omni-Customizer:ジョイントオーディオビデオ生成のためのエンドツーエンドマルチモーダルカスタマイズ
Authors: Yuheng Chen, Qingdong He, Teng Hu, Yuji Wang, Yabiao Wang, Lizhuang Ma, Jiangning Zhang,
Abstract要約: 複数のアイデンティティの正確なバインディングとシームレスな融合を目的としたエンドツーエンドフレームワークを提案する。本アーキテクチャでは,セマンティックアンコール型マルチモーダルロ (SA-MRo) を用いて,視覚的および音声的参照トークンとTS埋め込みを対応する意味記述に固定する。実験により、Omni-Contextはデュアルモーダルなカスタマイズ生成において最先端のパフォーマンスを達成することが示された。
参考スコア（独自算出の注目度）: 93.44732526074876
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The landscape of joint audio and video generation has been fundamentally transformed by the advent of powerful foundation models. Despite these strides, achieving cohesive multimodal customization for the simultaneous preservation of visual identities and vocal timbres across multiple interacting subjects remains largely underexplored. To bridge this gap, we present Omni-Customizer, an end-to-end framework targeted at the precise binding and seamless fusion of multimodal identity information. Specifically, we introduce an Omni-Context Fusion (OCF) module that effectively enriches the base textual prompt with dense, multimodal identity cues, along with a Masked TTS Cross-Attention (MTP-CA) mechanism explicitly designed to prevent the severe "speech leakage" problem. Within this architecture, we propose Semantic-Anchored Multimodal RoPE (SA-MRoPE) to anchor visual and audio reference tokens, along with TTS embeddings, to their corresponding semantic descriptions, enabling structured multimodal fusion and robust identity binding. Furthermore, we devise a comprehensive training strategy that incorporates interleaved audio-video scheduling to rapidly adapt the audio branch to multilingual scenarios without degrading foundational priors, and a progressive in-pair to cross-pair curriculum to facilitate the learning of high-level and robust identity features. Extensive experiments demonstrate that Omni-Customizer achieves state-of-the-art performance in dual-modal customized generation, excelling across visual identity similarity, timbre consistency, precise audio-video synchronization, and overall video-audio fidelity.
Abstract（参考訳）: ジョイントオーディオとビデオ生成の展望は、強力な基礎モデルの出現によって根本的に変化してきた。これらの動きにもかかわらず、視覚的アイデンティティと声帯の同時保存のための密集的なマルチモーダルなカスタマイズを実現することは、主に未解明のままである。このギャップを埋めるために,マルチモーダルID情報の正確なバインディングとシームレスな融合を目的としたエンドツーエンドフレームワークであるOmni-Customizerを提案する。具体的には,Omni-Context Fusion (OCF) モジュールを導入し,高密度なマルチモーダル・アイデンティティ・キューでテキストのプロンプトを効果的に強化し,MTP-CA(Masked TTS Cross-Attention) 機構を具体化する。本アーキテクチャでは,SA-MRoPE(Semantic-Anchored Multimodal RoPE)を用いて,視覚的および音声的参照トークンとTS埋め込みを対応する意味記述に固定し,構造化されたマルチモーダル融合とロバストIDバインディングを実現する。さらに、インターリーブされた音声-ビデオスケジューリングを組み込んだ総合的なトレーニング戦略を考案し、基礎的な先行性を損なうことなく、音声分岐を多言語シナリオに迅速に適応させるとともに、ハイレベルかつ堅牢なアイデンティティ特徴の学習を容易にするための、段階的な対面カリキュラムを考案する。 Omni-Customizerは、視覚的アイデンティティの類似性、音色の整合性、正確な音声とビデオの同期、そして全体的なビデオとオーディオの忠実さに優れる。

論文の概要: Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

関連論文リスト