Fugu-MT 論文翻訳(概要): BRACE: A Benchmark for Robust Audio Caption Quality Evaluation

論文の概要: BRACE: A Benchmark for Robust Audio Caption Quality Evaluation

arxiv url: http://arxiv.org/abs/2512.10403v1
Date: Thu, 11 Dec 2025 08:09:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-12 16:15:42.271761
Title: BRACE: A Benchmark for Robust Audio Caption Quality Evaluation
Title（参考訳）: BRACE:ロバストなオーディオキャプション品質評価のためのベンチマーク
Authors: Tianyu Guo, Hongyu Chen, Hao Liang, Meiyi Qiang, Bohan Zeng, Linzhuang Sun, Bin Cui, Wentao Zhang,
Abstract要約: BRACEは、参照なし環境でのオーディオアライメント品質を評価するために設計された新しいベンチマークである。 BRACEは、細かな字幕比較のためのBRACE-Mainと微妙な幻覚内容を検出するBRACE-Hallucinationの2つのサブベンチマークから構成される。 BRACEベンチマークを用いて,各種CLAPモデルでCLAPScoreを試験し,複数のLALMを評価した。
参考スコア（独自算出の注目度）: 23.704921982469063
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.
Abstract（参考訳）: 自動音声キャプションは、アクセシビリティやコンテンツインデクシングなどの応用を可能にするオーディオ理解に不可欠である。しかし、特に高品質の接頭辞が利用できない参照のない環境では、オーディオキャプションの品質を評価することが大きな課題である。 CLAPScoreは現在、最も広く使われているリファレンスフリーオーディオキャプチャ評価基準(ACEM)であるが、様々な条件下での堅牢性は体系的に検証されていない。このギャップに対処するために、参照不要な環境でのオーディオアライメント品質を評価するための新しいベンチマークBRACEを導入する。 BRACEは主にACEMを評価するために設計されており、Large Audio Language Model(LALM)のモードアライメント能力を測定するために拡張することもできる。 BRACEは、細かな字幕比較のためのBRACE-Mainと微妙な幻覚内容を検出するBRACE-Hallucinationの2つのサブベンチマークから構成される。我々はこれらのデータセットを,高品質なフィルタリング,LLMに基づく汚職,人間のアノテーションによって構築する。参照レスACEMとしてCLAPScoreが広く採用され,音声言語タスクにおけるLALMの適用が増加していることを踏まえ,BRACEベンチマークを用いてCLAPScoreを様々なCLAPモデルでテストし,複数のLALMを評価する。特に、最も優れたCLAPベースのACEMでさえ、BRACE-Mainベンチマークで70.01F1スコアしか達成せず、最高のLALMは63.19にしか達していない。 CLAPモデルとLALMの限界を明らかにすることで、BRACEベンチマークは将来の研究の方向性に関する貴重な洞察を提供する。

論文の概要: BRACE: A Benchmark for Robust Audio Caption Quality Evaluation

関連論文リスト