Fugu-MT 論文翻訳(概要): OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models

論文の概要: OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models

arxiv url: http://arxiv.org/abs/2509.26140v1
Date: Tue, 30 Sep 2025 11:57:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.525724
Title: OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models
Title（参考訳）: OWL:オーディオ大言語モデルのための幾何学的空間推論
Authors: Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam,
Abstract要約: 音響特徴を3次元空間構造に整合させる幾何対応オーディオエンコーダであるtextbfSpatial-Acoustic Geometry (SAGE$)を紹介した。我々は、$textbfSAGE$と空間的に接地されたチェーン・オブ・シークレットを統合したALLMである$textbfOWL$を示し、その方向(DoA)と距離推定について合理化する。知覚的QAから多段階推論へのカリキュラム学習を通じて、$textbfOWL$は12時レベルの方位とDoA推定をサポートする
参考スコア（独自算出の注目度）: 1.5599296461516985
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatial reasoning is fundamental to auditory perception, yet current audio large language models (ALLMs) largely rely on unstructured binaural cues and single step inference. This limits both perceptual accuracy in direction and distance estimation and the capacity for interpretable reasoning. Recent work such as BAT demonstrates spatial QA with binaural audio, but its reliance on coarse categorical labels (left, right, up, down) and the absence of explicit geometric supervision constrain resolution and robustness. We introduce the $\textbf{Spatial-Acoustic Geometry Encoder (SAGE}$), a geometry-aware audio encoder that aligns binaural acoustic features with 3D spatial structure using panoramic depth images and room-impulse responses at training time, while requiring only audio at inference. Building on this representation, we present $\textbf{OWL}$, an ALLM that integrates $\textbf{SAGE}$ with a spatially grounded chain-of-thought to rationalize over direction-of-arrivals (DoA) and distance estimates. Through curriculum learning from perceptual QA to multi-step reasoning, $\textbf{OWL}$ supports o'clock-level azimuth and DoA estimation. To enable large-scale training and evaluation, we construct and release $\textbf{BiDepth}$, a dataset of over one million QA pairs combining binaural audio with panoramic depth images and room impulse responses across both in-room and out-of-room scenarios. Across two benchmark datasets, our new $\textbf{BiDepth}$ and the public SpatialSoundQA, $\textbf{OWL}$ reduces mean DoA error by $\textbf{11$^{\circ}$}$ through $\textbf{SAGE}$ and improves spatial reasoning QA accuracy by up to $\textbf{25}$\% over BAT.
Abstract（参考訳）: 空間的推論は聴覚知覚の基本であるが、現在の音声大言語モデル(ALLM)は非構造的バイノーラルな手がかりと単一ステップ推論に大きく依存している。これにより、方向推定と距離推定の両方の知覚精度が制限され、解釈可能な推論の能力が制限される。 BATのような最近の研究は、バイノーラルオーディオを用いた空間的QAを示すが、粗い分類ラベル(左、右、上、下)に依存しており、明確な幾何学的監督的制約の解決と頑健さが欠如している。本稿では,2次元音響特徴と3次元空間構造をパノラマ深度画像と室内インパルス応答を用いて整列し,推論時にのみ音声を必要とする幾何学的音響エンコーダである$\textbf{Spatial-Acoustic Geometry Encoder (SAGE}$)を紹介する。この表現に基づいて、$\textbf{OWL}$, $\textbf{SAGE}$と空間的に接地されたチェーン・オブ・シークレットを統合したALLMを提示し、位置方向(DoA)と距離推定を合理化する。知覚的QAから多段階推論へのカリキュラム学習を通じて、$\textbf{OWL}$は12時レベルの方位とDoA推定をサポートする。大規模トレーニングと評価を可能にするため,室内と室内の両方のシナリオにおいて,バイノーラルオーディオとパノラマ深度画像と室内インパルス応答を組み合わせた100万以上のQAペアのデータセットである$\textbf{BiDepth}$を構築し,リリースする。新しい$\textbf{BiDepth}$と公開SpatialSoundQA, $\textbf{OWL}$は、平均DoAエラーを$\textbf{11$^{\circ}$}$から$\textbf{SAGE}$まで削減し、空間推論QA精度を$\textbf{25}$\% over BATで改善します。

論文の概要: OWL: Geometry-Aware Spatial Reasoning for Audio Large Language Models

関連論文リスト