Fugu-MT 論文翻訳(概要): AHELM: A Holistic Evaluation of Audio-Language Models

論文の概要: AHELM: A Holistic Evaluation of Audio-Language Models

arxiv url: http://arxiv.org/abs/2508.21376v2
Date: Tue, 02 Sep 2025 17:58:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 11:03:28.034076
Title: AHELM: A Holistic Evaluation of Audio-Language Models
Title（参考訳）: AHELM:音声言語モデルの全体的評価
Authors: Tony Lee, Haoqin Tu, Chi Heem Wong, Zijun Wang, Siwei Yang, Yifan Mai, Yuyin Zhou, Cihang Xie, Percy Liang,
Abstract要約: マルチモーダルオーディオ言語モデル(ALM)は、インターリーブされた音声とテキストを入力および出力テキストとして取り込む。 AHELMは、PARADEとCoRe-Benchと呼ばれる2つの新しい合成オーディオテキストデータセットを含む、さまざまなデータセットを集約するベンチマークである。また、モデル間の等価比較を確保するために、プロンプト、推論パラメータ、評価指標を標準化する。
参考スコア（独自算出の注目度）: 78.20477815156484
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluations of audio-language models (ALMs) -- multimodal models that take interleaved audio and text as input and output text -- are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering -- to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness ($p=0.01$) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 6th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at https://crfm.stanford.edu/helm/audio/v1.0.0. AHELM is intended to be a living benchmark and new datasets and models will be added over time.
Abstract（参考訳）: 音声言語モデル(ALM) -- インターリーブされた音声とテキストを入力および出力テキストとして取り込むマルチモーダルモデル -- の評価は、標準化されたベンチマークの欠如によって妨げられている。さらに、モデル間の比較は、異なる評価が限られた数のモデルをテストし、異なるプロンプトメソッドと推論パラメータを使用するため困難である。これらの欠点に対処するため、AHELMは、ステレオタイプを避けるためのALMを評価するPARADEと呼ばれる2つの新しい合成オーディオテキストデータセットを含む、さまざまなデータセットを集約するベンチマーク、CoRe-Benchを紹介します。また、モデル間の等価比較を確保するために、プロンプト、推論パラメータ、評価指標を標準化する。 3人の開発者による14のオープンウェイトおよびクローズドAPIALMと、それぞれ自動音声認識と言語モデルからなる3つの簡単なベースラインシステムをテストする。以上の結果から,ジェミニ2.5 Proは10面中5面にランクインするが,ASRタスクではグループ不公平(p=0.01$)を示すのに対し,他のモデルではそうではないことが示唆された。また,AHELMのベースラインシステムは,音声からテキストまでの能力しか持たないにも関わらず,総合で6位にランクインした。透明性のために、すべての生のプロンプト、モデル世代、アウトプットは、https://crfm.stanford.edu/helm/audio/v1.0.0で公開されています。 AHELMは生きたベンチマークを意図しており、新しいデータセットとモデルが追加される予定だ。

論文の概要: AHELM: A Holistic Evaluation of Audio-Language Models

関連論文リスト