Fugu-MT 論文翻訳(概要): MAEB: Massive Audio Embedding Benchmark

論文の概要: MAEB: Massive Audio Embedding Benchmark

arxiv url: http://arxiv.org/abs/2602.16008v1
Date: Tue, 17 Feb 2026 21:00:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-19 15:58:30.4363
Title: MAEB: Massive Audio Embedding Benchmark
Title（参考訳）: MAEB: 大量のオーディオ埋め込みベンチマーク
Authors: Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha, Rahul Chand, Silky Singh, Kaitlyn Wang, Ali Sartaz Khan, Marc Moussa Nasser, Sufen Fong, Pengfei He, Alan Xiao, Ayush Sunil Munot, Aditya Shrivastava, Artem Gazizov, Niklas Muennighoff, Kenneth Enevoldsen,
Abstract要約: Massive Audio Embedding Benchmarkは100以上の言語で音声、音楽、環境音、モーダルな音声テキスト推論を30のタスクでカバーしている。 50以上のモデルを評価し、すべてのタスクで1つのモデルが支配的でないことを発見した。クラスタリングは、すべてのモデルにとって依然として困難であり、最高のパフォーマンスのモデルでさえ、控えめな結果しか得られない。
参考スコア（独自算出の注目度）: 13.002273534113113
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.
Abstract（参考訳）: 我々は,100以上の言語における音声,音楽,環境音,モーダル音声テキストの相互推論を対象とする大規模ベンチマークであるMassive Audio Embedding Benchmark (MAEB)を紹介した。コントラッシブ音声テキストモデルは環境音の分類(例えば、ESC50)において優れるが、多言語音声タスク(例えば、SIB-FLEURS)ではランダムに近いスコアを示し、音声予測モデルでは、その逆のパターンを示す。クラスタリングは、すべてのモデルにとって依然として困難であり、最高のパフォーマンスのモデルでさえ、控えめな結果しか得られない。音響的理解に優れるモデルはしばしば言語的タスクに不利な結果をもたらすことが観察され、その逆も観察される。また,MAEBにおける音声エンコーダの性能は,音声大言語モデルにおいて高い相関性を示す。 MAEBは98タスクの集合であるMAEB+に由来する。 MAEBは、評価コストを低減しつつタスクの多様性を維持するように設計されており、テキスト、画像、オーディオモダリティの統一評価のためにMTEBエコシステムに統合されている。 MAEBと98のタスクとコード、https://github.com/embeddings-benchmark/mteb.comでリーダーボードをリリースします。

論文の概要: MAEB: Massive Audio Embedding Benchmark

関連論文リスト