Fugu-MT 論文翻訳(概要): ORCA: Open-ended Response Correctness Assessment for Audio Question Answering

論文の概要: ORCA: Open-ended Response Correctness Assessment for Audio Question Answering

arxiv url: http://arxiv.org/abs/2512.09066v1
Date: Fri, 28 Nov 2025 14:41:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-15 04:16:52.578237
Title: ORCA: Open-ended Response Correctness Assessment for Audio Question Answering
Title（参考訳）: ORCA:音声質問応答に対するオープンエンド応答精度評価
Authors: Šimon Sedláček, Sara Barahona, Bolaji Yusuf, Laura Herrera-Alarcón, Santosh Kesiraju, Cecilia Bolaños, Alicia Lozano-Diez, Sathvik Udupa, Fernando López, Allison Ferner, Ramani Duraiswami, Jan Černocký,
Abstract要約: 本研究では,予測精度と不確実性の両方を予測するために,ベータ分布を用いた人的判断の変動をモデル化するフレームワークORCAを提案する。我々は15のLALMから11,721のアノテーションを収集し,0.82(クリッペンドルフのα)のアノテータ間契約を達成した。
参考スコア（独自算出の注目度）: 41.72231074041232
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating open-ended responses from large audio language models (LALMs) is challenging because human annotators often genuinely disagree on answer correctness due to multiple valid interpretations, partial correctness, and subjective judgment. Traditional metrics reporting only mean scores fail to capture this uncertainty. We present ORCA (Open-ended Response Correctness Assessment), a framework that models the variability in human judgments using Beta distributions to predict both expected correctness and uncertainty. Our three-stage annotation framework combines human judgment with structured feedback and iterative refinement to simultaneously curate training data and improve benchmark quality. We collected 11,721 annotations across 3,580 question-answer pairs from 15 LALMs on two audio QA benchmarks, achieving inter-annotator agreement of 0.82 (Krippendorff's alpha). ORCA achieves 0.91 Spearman correlation with mean human judgments, matching or outperforming LLM-judge baselines while providing uncertainty estimates and requiring significantly less compute. We release our models, code, and curated dataset.
Abstract（参考訳）: 複数の有効な解釈、部分的正当性、主観的判断により、人間のアノテータは答えの正当性について真に意見が一致しないことが多いため、大規模な音声言語モデル(LALM)からのオープンエンド応答の評価は困難である。従来のメトリクスレポートでは、スコアだけがこの不確実性を捉えることができません。 ORCA(Open-ended Response Correctness Assessment)は,ベータ分布を用いた人間の判断の変動をモデル化し,予測された正しさと不確実性の両方を予測するフレームワークである。我々の3段階のアノテーションフレームワークは、人間の判断と構造化されたフィードバックと反復的な改善を組み合わせることで、トレーニングデータを同時にキュレートし、ベンチマーク品質を向上させる。我々は15のLALMから3,580の質問応答対に11,721のアノテーションを2つのオーディオQAベンチマークで収集し,0.82(クリッペンドルフのα)のアノテーション間契約を達成した。 ORCAは平均的な人間の判断と0.91のスピアマン相関を達成し、LLM-judgeベースラインの整合性や性能を向上し、不確実性の評価を提供し、計算量を大幅に削減する。モデル、コード、およびキュレートされたデータセットをリリースします。

論文の概要: ORCA: Open-ended Response Correctness Assessment for Audio Question Answering

関連論文リスト