Fugu-MT 論文翻訳(概要): ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

論文の概要: ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

arxiv url: http://arxiv.org/abs/2605.03361v1
Date: Tue, 05 May 2026 04:44:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 19:35:43.770103
Title: ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval
Title（参考訳）: ReasonAudio: テキスト監査検索におけるマッチングを超えて推論を評価するベンチマーク
Authors: Honglei Zhang, Yuting Chen, Chenpeng Hu, Siyue Zhang, Yilei Shi,
Abstract要約: ReasonAudioはText-Audio Retrievalの最初の推論集約型ベンチマークである。 1000のクエリと10,000の合成オーディオクリップからなり、ネゲーション、オーダー、オーバーラップ、デュレーション、ミックスの5つの基本的な推論タスクにまたがる。我々の10種類の最先端モデルに対する評価では、以下の結果が示される: すべてのモデルは、推論集約的な音声検索に苦慮する。
参考スコア（独自算出の注目度）: 9.400944614656735
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: As multimodal content continues to expand at a rapid pace, audio retrieval has emerged as a key enabling technology for media search, content organization, and intelligent assistants. However, most existing benchmarks concentrate on semantic matching and fail to capture the fact that real-world queries often demand advanced reasoning abilities, including negation understanding, temporal ordering, concurrent event recognition, and duration discrimination. To address this gap, we introduce ReasonAudio, the first reasoning-intensive benchmark for Text-Audio Retrieval, comprising 1,000 queries and 10,000 composite audio clips across five fundamental reasoning tasks: Negation, Order, Overlap, Duration, and Mix. Despite their intuitive nature for humans and straightforward construction, these tasks pose significant challenges to current models. Our evaluation of ten state-of-the-art models reveals the following findings: All models struggle with reasoning-intensive audio retrieval, performing particularly poorly on Negation and Duration while showing relatively better results on Overlap and Order. Moreover, Multimodal Large Language Model-based embedding models fail to inherit the reasoning capabilities of their backbones through contrastive fine-tuning, suggesting that current training paradigms are insufficient to preserve reasoning capacity in retrieval settings
Abstract（参考訳）: マルチモーダルコンテンツは急速に拡大し続けており、メディア検索、コンテンツ組織、インテリジェントアシスタントの鍵となる技術として、音声検索が登場している。しかし、既存のベンチマークのほとんどはセマンティックマッチングに集中しており、現実のクエリが否定的理解、時間的順序付け、同時イベント認識、時間的区別といった高度な推論能力を必要とするという事実を捉えていない。このギャップに対処するため、テキストオーディオ検索のための最初の推論集約ベンチマークであるReasonAudioを紹介します。人間にとって直感的な性質と簡単な構成にもかかわらず、これらのタスクは現在のモデルに重大な課題をもたらす。全てのモデルは推論集約的な音声検索に苦戦し、特に否定と継続に悪影響を及ぼし、オーバーラップとオーダーでは比較的良い結果を示した。さらに、マルチモーダル大規模言語モデルに基づく埋め込みモデルは、対照的な微調整により、バックボーンの推論能力の継承に失敗し、現在のトレーニングパラダイムは、検索設定における推論能力を維持するのに不十分であることを示唆している。

論文の概要: ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

関連論文リスト