Fugu-MT 論文翻訳(概要): Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

論文の概要: Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

arxiv url: http://arxiv.org/abs/2604.10708v1
Date: Sun, 12 Apr 2026 16:08:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.184191
Title: Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
Title（参考訳）: Audio-Omni: マルチモーダル理解をVersatileオーディオ生成と編集に拡張する
Authors: Zeyue Tian, Binxin Yang, Zhaoyang Liu, Jiexuan Zhang, Ruibin Yuan, Hubery Yin, Qifeng Chen, Chen Li, Jing Lv, Wei Xue, Yike Guo,
Abstract要約: Audio-Omniは、一般的な音、音楽、音声ドメイン間で生成と編集を統合する最初のエンドツーエンドフレームワークである。高次推論のための凍結型マルチモーダル大言語モデルと高忠実度合成のためのトレーニング可能な拡散変換器を併用する。 AudioEditは100万以上の精巧にキュレートされた編集ペアからなる大規模なデータセットである。
参考スコア（独自算出の注目度）: 63.573256490583724
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three tasks underexplored. While some pioneering works have explored unifying audio understanding and generation, they often remain confined to specific domains. To address this, we introduce Audio-Omni, the first end-to-end framework to unify generation and editing across general sound, music, and speech domains, with integrated multi-modal understanding capabilities. Our architecture synergizes a frozen Multimodal Large Language Model for high-level reasoning with a trainable Diffusion Transformer for high-fidelity synthesis. To overcome the critical data scarcity in audio editing, we construct AudioEdit, a new large-scale dataset comprising over one million meticulously curated editing pairs. Extensive experiments demonstrate that Audio-Omni achieves state-of-the-art performance across a suite of benchmarks, outperforming prior unified approaches while achieving performance on par with or superior to specialized expert models. Beyond its core capabilities, Audio-Omni exhibits remarkable inherited capabilities, including knowledge-augmented reasoning generation, in-context generation, and zero-shot cross-lingual control for audio generation, highlighting a promising direction toward universal generative audio intelligence. The code, model, and dataset will be publicly released on https://zeyuet.github.io/Audio-Omni.
Abstract（参考訳）: マルチモーダルモデルの最近の進歩は、音声理解、生成、編集の急速な進歩をもたらした。しかしながら、これらの機能は一般的に、専門的なモデルによって対処され、未調査の3つのタスクをシームレスに統合できる、真に統一されたフレームワークの開発が残されている。先駆的な作品の中には、音声の理解と生成の統一を探求するものもあるが、それらは特定の領域に限られることが多い。これを解決するために,一般音・音楽・音声領域における生成・編集を統合化するための,初のエンドツーエンドフレームワークであるAudio-Omniを紹介した。高次推論のための凍結型マルチモーダル大言語モデルと高忠実度合成のためのトレーニング可能な拡散変換器を併用する。音声編集において重要なデータ不足を克服するために,100万以上の精巧にキュレートされた編集ペアからなる大規模データセットであるAudioEditを構築した。大規模な実験では、Audio-Omniは一連のベンチマークで最先端のパフォーマンスを達成し、従来の統一されたアプローチよりも優れ、専門的な専門家モデルと同等以上のパフォーマンスを実現している。コア機能に加えて、Audio-Omniは、知識強化推論生成、コンテキスト内生成、音声生成のためのゼロショット言語間制御など、優れた継承機能を示し、普遍的な生成的オーディオインテリジェンスへの有望な方向性を強調している。コード、モデル、データセットはhttps://zeyuet.github.io/Audio-Omni.comで公開される。

論文の概要: Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

関連論文リスト