Fugu-MT 論文翻訳(概要): STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

論文の概要: STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

arxiv url: http://arxiv.org/abs/2510.24693v1
Date: Tue, 28 Oct 2025 17:50:34 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 17:50:20.198079
Title: STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence
Title（参考訳）: STAR-Bench:オーディオ4Dインテリジェンスとしての深部時空間推論
Authors: Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Jianze Liang, Xie Chen, Leilei Sun, Dahua Lin, Jiaqi Wang,
Abstract要約: 時間と3次元空間における音波力学の推論として定義される音声4次元インテリジェンスを形式化する。 STAR-Benchは、基礎的な音響知覚設定とホロスティックな時空間推論設定を組み合わせる。データキュレーションパイプラインは、高品質なサンプルを保証するために2つの方法を使用します。
参考スコア（独自算出の注目度）: 81.94084852268468
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
Abstract（参考訳）: マルチモーダルなLarge Language ModelsとLarge Audio-Language Modelsの急速な進歩にもかかわらず、既存のオーディオベンチマークは主にテキストキャプションから回復できるセマンティクスをテストする。時間と3次元空間における音波力学の推論として定義される音声4Dインテリジェンスを定式化し,STAR-Benchを導入して測定する。 STAR-Benchは、基本的音響知覚設定(絶対的および相対的条件下での6つの属性)と、連続的および離散的なプロセスと静的なローカライゼーション、マルチソース関係、動的軌跡にまたがる空間的タスクのセグメント並べ替えを含むホロスティックな時空間推論設定を組み合わせる。データキュレーションパイプラインは、高品質なサンプルを保証するために2つの方法を使用します。基礎的なタスクには、手続き的に合成された音声と物理シミュレーションオーディオを用いる。全体データについては、人間のアノテーションと人間のパフォーマンスに基づく最終選択を含む4段階のプロセスに従う。キャプションのみの回答が精度をわずかに低下させる以前のベンチマークとは異なり、STAR-Benchは、言語的に難解なキューに焦点をあてた、はるかに大きなドロップ(31.5\% 時空間、35.2\% 時空間)を誘導する。クローズドソースモデルはきめ細かい知覚によってボトルネックを受けており、オープンソースモデルは知覚、知識、推論にまたがる遅延がある。我々のSTAR-Benchは、物理的な世界をより堅牢に理解し、将来のモデルを開発するための重要な洞察と明確な道筋を提供します。

論文の概要: STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

関連論文リスト