Fugu-MT 論文翻訳(概要): EvA: An Evidence-First Audio Understanding Paradigm for LALMs

論文の概要: EvA: An Evidence-First Audio Understanding Paradigm for LALMs

arxiv url: http://arxiv.org/abs/2603.27667v1
Date: Sun, 29 Mar 2026 12:32:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.064419
Title: EvA: An Evidence-First Audio Understanding Paradigm for LALMs
Title（参考訳）: EvA: LALMのためのエビデンスファーストオーディオ理解パラダイム
Authors: Xinyuan Xie, Shunian Chen, Zhiheng Liu, Yuhao Zhang, Zhiqiang Lv, Liyin Liang, Benyou Wang,
Abstract要約: EvA (Evidence-First Audio) は、Whisper と CED-Base を非圧縮・時間整合融合で組み合わせたデュアルパスアーキテクチャである。 EvA-Perceptionは、約54Kのイベント順序キャプション(150h)と約500KのQAペアを備えた、大規模なオープンソーストレーニングセットです。統一されたゼロショットプロトコルの下では、EvAは、MMAU、MMAR、MMSUで最高のオープンソースパーセプションスコアを達成し、報告されたすべての指標でKim-Audio-7Bよりも改善され、知覚重分が最大となる。
参考スコア（独自算出の注目度）: 32.05922674181507
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We call this failure the evidence bottleneck: state-of-the-art systems show larger deficits in evidence extraction than in downstream reasoning, suggesting that the main limitation lies in upstream perception rather than reasoning policy. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that combines Whisper and CED-Base through non-compressive, time-aligned fusion. EvA first aggregates intermediate CED layers to preserve multi-scale acoustic cues, then aligns the aggregated CED features to the Whisper timeline and adds the two streams without changing sequence length. We also build EvA-Perception, a large-scale open-source training set with about 54K event-ordered captions (150 h) and about 500K QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source Perception scores on MMAU, MMAR, and MMSU, and improves over Kimi-Audio-7B on all reported metrics, with the largest gains on perception-heavy splits. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning.
Abstract（参考訳）: 大規模音声言語モデル(LALM)は、推論が始まる前にタスク関連音響証拠の保存に失敗することが多いため、複雑な音響シーンで依然として苦戦している。我々はこの失敗をエビデンスボトルネックと呼んでいる: 最先端のシステムは、下流の推論よりも証拠抽出の欠陥が大きいことを示しており、主要な制限は、推論ポリシーよりも上流の認識にあることを示唆している。この問題を解決するために,Whisper と CED-Base を非圧縮・時間整合融合により結合したデュアルパスアーキテクチャ EvA (Evidence-First Audio) を提案する。 EvAは、まず中間のCED層を集約し、マルチスケールの音響的手がかりを保存し、次に集約されたCED特徴をWhisperタイムラインに整列し、シーケンス長を変えることなく2つのストリームを追加する。 EvA-Perceptionは、約54Kのイベント順序キャプション(150h)と約500KのQAペアを備えた、大規模なオープンソーストレーニングセットです。統一されたゼロショットプロトコルの下では、EvAは、MMAU、MMAR、MMSUで最高のオープンソースパーセプションスコアを達成し、報告されたすべての指標でKim-Audio-7Bよりも改善され、知覚重分が最大となる。これらの結果はエビデンスファースト仮説を支持しており、より強い音声理解は推論の前にアコースティックエビデンスを保存することに依存する。

論文の概要: EvA: An Evidence-First Audio Understanding Paradigm for LALMs

関連論文リスト