Fugu-MT 論文翻訳(概要): MMAE: A Massive Multitask Audio Editing Benchmark

論文の概要: MMAE: A Massive Multitask Audio Editing Benchmark

arxiv url: http://arxiv.org/abs/2606.07229v1
Date: Fri, 05 Jun 2026 12:52:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.741049
Title: MMAE: A Massive Multitask Audio Editing Benchmark
Title（参考訳）: MMAE:マルチタスクオーディオ編集ベンチマーク
Authors: Ziyang Ma, Ruiqi Yan, Ruiyang Xu, Jie Fang, Zhikang Niu, Yi-Wen Chao, Wenming Tu, Tianrui Wang, Auden, Qi Chen, Wenxi Chen, Jiaying Chi, Yanru Huo, Zixuan Jiang, Xiquan Li, Yalin Li, Junxi Liu, Minghao Liu, Binghao Qiang, Yijia Shan, Zheshu Song, Tian Tan, Zixiang Wang, Zeyu Xie, Zhifei Xie, Xiaoyu Xing, Qixiang Xu, Chen Yang, Guanrou Yang, Shan Yang, Yifan Yang, Steve Yves, Haotian Zhang, Haina Zhu, Kai Yu, Liefeng Bo, Eng-Siong Chng, Xie Chen,
Abstract要約: MMAEは汎用的な命令ベースの音声編集のために設計された最初の総合的な評価テストベッドである。 2,000個の高忠実度サンプルを、先駆的なルーリック評価フレームワークと組み合わせて構成する。評価の結果,既存のシステムは信頼性の高い編集には程遠いことが判明した。
参考スコア（独自算出の注目度）: 66.74229858407413
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.
Abstract（参考訳）: 本稿では,Multitask 音声編集ベンチマーク MMAE を紹介する。インテリジェントな創造へのシフトによって、インタラクティブな編集は、画像のNano-banana 2やビデオのGemini-Omniといったモデルからオーディオへと、視覚領域から急速に拡大した。しかし、現在の評価基盤は著しく遅れており、高度に断片化され、特定のサブドメインや基本的な操作に限定されている。範囲が限られている既存のベンチマークとは異なり、MMAEは、音、スピーチ、音楽、およびそれらの混合を含む7つの異なるオーディオモダリティを含む、現実世界のシナリオの範囲にまで拡張している。さらに,基本的修正からマルチホップ推論,多ラウンド編集,2レベルの粒度,8種類の異なる操作タイプまで,タスク複雑性の6つのレベルにまたがる包括的分類を確立した。 MMAEは、人間とエージェントのコラボレーションを通じて微妙にキュレートされ、2000個の高忠実度サンプルと、先駆的なルーリックベースの評価フレームワークを組み合わせて構成する。自由形式のタスクを17,741の検証基準に分解することにより、この頑健なルーブリックベースのパラダイムは、命令追従とコンテキスト整合性の正確かつ多次元的な評価を可能にする。我々の先行モデルに対する広範な評価から、現在のシステムは信頼性の高い編集を達成できていないことが分かる。厳密に言うと、Exact Match Rate (EMR) は5%以下で、複雑な混合モダリティタスクにおいて絶対0%まで低下し、正確な実行と構造的堅牢性において重大なボトルネックを露呈する。我々は,MMAEが知的な創造コミュニティの今後の発展の触媒となり,明確な診断ロードマップを提供し,次世代オーディオ編集システムのための標準化された長期評価パラダイムを確立することを願っている。

論文の概要: MMAE: A Massive Multitask Audio Editing Benchmark

関連論文リスト