Fugu-MT 論文翻訳(概要): OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

論文の概要: OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

arxiv url: http://arxiv.org/abs/2604.04348v1
Date: Mon, 06 Apr 2026 01:43:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.058063
Title: OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text
Title（参考訳）: OmniSonic:ビデオとテキストからユニバーサルでホロスティックなオーディオ生成を目指す
Authors: Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen, Shijian Deng, Kai Wang, Yunhui Guo, Yapeng Tian,
Abstract要約: ユニバーサル・ホロスティック・オーディオ・ジェネレーション(UniHAGen)を提案する。 UniHAGenは、オンスクリーンとオフスクリーンの両方のサウンドを含む包括的な聴覚シーンを生成するタスクである。ビデオとテキストに条件付きフローマッチングベースの拡散フレームワークであるOmniSonicを紹介する。
参考スコア（独自算出の注目度）: 46.65856772563035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/
Abstract（参考訳）: 本稿では,様々な領域(例えば,環境イベント,楽器,人間の発話など)にまたがるオンスクリーンとオフスクリーンの両方を含む包括的聴覚シーンを合成するユニバーサル・ホロスティック・オーディオ・ジェネレーション(UniHAGen)を提案する。従来のビデオコンディショニングオーディオ生成モデルは、通常、画面外の聴覚イベントを無視して、可視音イベントに対応するスクリーン上の環境音を生成することに焦点を当てていた。近年の総合的な共同音声合成モデルは、オンスクリーン音声とオフスクリーン音声の両方で聴覚シーンを生成することを目的としているが、人間の音声の生成や統合能力に欠ける非音声のみに限られている。これらの制限を克服するために,ビデオとテキストを併用したフローマッチングベースの拡散フレームワークであるOmniSonicを導入する。オンスクリーン環境音、オフスクリーン環境音、音声条件を同時に処理するための3つのクロスアテンション操作を行うTriAttn-DiTアーキテクチャと、世代間のコントリビューションを適応的にバランスさせるMixture-of-Experts(MoE)ゲーティング機構を備えている。さらに,UniHAGen-Benchは,3つの画面上/オフの音声環境シナリオを対象とする1万以上のサンプルを用いた新しいベンチマークである。大規模な実験により、OmniSonicは客観的指標と人的評価の両方における最先端のアプローチを一貫して上回り、普遍的で全体論的音声生成の強力なベースラインを確立している。プロジェクトページ: https://weiguopian.github.io/OmniSonic_webpage/

論文の概要: OmniSonic: Towards Universal and Holistic Audio Generation from Video and Text

関連論文リスト