Fugu-MT 論文翻訳(概要): OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models

論文の概要: OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models

arxiv url: http://arxiv.org/abs/2603.23938v1
Date: Wed, 25 Mar 2026 05:00:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.136753
Title: OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models
Title（参考訳）: OmniACBench: Omni-Modal モデルを用いた音場制御評価ベンチマーク
Authors: Seunghee Kim, Bumkyu Park, Kyudan Jung, Joosung Lee, Soyoon Kim, Jeonghoon Kim, Taeuk Kim, Hwiyeol Jo,
Abstract要約: オムニACBench(OmniACBench)は、Omni-Modalモデルにおいて、コンテキストグラウンド音響制御を評価するためのベンチマークである。音声命令、テキストスクリプト、画像が与えられた場合、モデルは適切なトーンとやり方で読み取らなければならない。 8つのモデルでの実験では、テキスト出力評価に強い性能があるにもかかわらず、提案された設定において制限が示される。
参考スコア（独自算出の注目度）: 17.817469065260124
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most testbeds for omni-modal models assess multimodal understanding via textual outputs, leaving it unclear whether these models can properly speak their answers. To study this, we introduce OmniACBench, a benchmark for evaluating context-grounded acoustic control in omni-modal models. Given a spoken instruction, a text script, and an image, a model must read the script aloud with an appropriate tone and manner. OmniACBench comprises 3,559 verified instances covering six acoustic features: speech rate, phonation, pronunciation, emotion, global accent, and timbre. Extensive experiments on eight models reveal their limitations in the proposed setting, despite their strong performance on prior textual-output evaluations. Our analyses show that the main bottleneck lies not in processing individual modalities, but in integrating multimodal context for faithful speech generation. Moreover, we identify three common failure modes-weak direct control, failed implicit inference, and failed multimodal grounding-providing insights for developing models that can verbalize responses effectively.
Abstract（参考訳）: オムニモーダルモデルのほとんどのテストベッドは、テキスト出力によるマルチモーダル理解を評価しており、これらのモデルが解答を適切に表現できるかどうかは不明である。そこで本研究では,OmniACBenchを用いて,Omni-modalモデルにおける音場制御の評価を行う。音声命令、テキストスクリプト、画像が与えられた場合、モデルは適切なトーンとやり方で読み取らなければならない。 OmniACBenchは、音声、発音、感情、グローバルアクセント、音色という6つの音響特徴をカバーする3,559の検証済みインスタンスで構成されている。 8つのモデルに対する大規模な実験は、事前のテキスト出力評価に強い性能があるにもかかわらず、提案した設定における制限を明らかにしている。分析の結果、主なボトルネックは個々のモダリティの処理ではなく、忠実な音声生成のためのマルチモーダルコンテキストの統合にあることがわかった。さらに,3つの共通障害モード-弱直接制御,暗黙の推論の失敗,および応答を効果的に音声化できるモデルを開発するためのマルチモーダルグラウンド作成による洞察の失敗を同定した。

論文の概要: OmniACBench: A Benchmark for Evaluating Context-Grounded Acoustic Control in Omni-Modal Models

関連論文リスト