Fugu-MT 論文翻訳(概要): Sci-Phi: A Large Language Model Spatial Audio Descriptor

論文の概要: Sci-Phi: A Large Language Model Spatial Audio Descriptor

arxiv url: http://arxiv.org/abs/2510.05542v1
Date: Tue, 07 Oct 2025 03:06:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:08.080964
Title: Sci-Phi: A Large Language Model Spatial Audio Descriptor
Title（参考訳）: Sci-Phi: 大規模言語モデル空間オーディオディスクリプタ
Authors: Xilin Jiang, Hannes Gamper, Sebastian Braun,
Abstract要約: Sci-Phi は空間空間エンコーダとスペクトルエンコーダを備えた空間音響モデルである。 1回のパスで最大4つの方向の音源を列挙し、記述する。性能をわずかに低下させるだけで、実際の部屋のインパルス応答に一般化する。
参考スコア（独自算出の注目度）: 25.302416479626974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Acoustic scene perception involves describing the type of sounds, their timing, their direction and distance, as well as their loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents Sci-Phi, a spatial audio large language model with dual spatial and spectral encoders that estimates a complete parameter set for all sound sources and the surrounding environment. Learning from over 4,000 hours of synthetic first-order Ambisonics recordings including metadata, Sci-Phi enumerates and describes up to four directional sound sources in one pass, alongside non-directional background sounds and room characteristics. We evaluate the model with a permutation-invariant protocol and 15 metrics covering content, location, timing, loudness, and reverberation, and analyze its robustness across source counts, signal-to-noise ratios, reverberation levels, and challenging mixtures of acoustically, spatially, or temporally similar sources. Notably, Sci-Phi generalizes to real room impulse responses with only minor performance degradation. Overall, this work establishes the first audio LLM capable of full spatial-scene description, with strong potential for real-world deployment. Demo: https://sci-phi-audio.github.io/demo
Abstract（参考訳）: 音響シーンの知覚には、音の種類、タイミング、方向、距離、音の大きさ、残響が記述される。音声言語モデルは音声認識において優れているが、単一チャネル入力は空間的理解を根本的に制限する。本研究では,空間空間およびスペクトルエンコーダを備えた空間音響大言語モデルであるSci-Phiについて述べる。 Sci-Phiはメタデータを含む4000時間以上の合成一階録音から学び、最大4つの方向の音源を1回のパスで記述し、非方向の背景音と部屋の特徴を記述している。提案手法は, 音源数, 信号対雑音比, 残響レベル, 音響的, 空間的, 時間的に類似した音源の混合を対象とし, コンテント, 位置, タイミング, 音量, 残響を網羅する15の指標を用いて評価した。特に、Sci-Phiは、小さな性能劣化だけで実際の部屋のインパルス応答に一般化する。全体として、この研究は空間シーンの完全な記述が可能な最初のオーディオLLMを確立し、実世界の展開に強い可能性を秘めている。デモ:https://sci-phi-audio.github.io/demo

論文の概要: Sci-Phi: A Large Language Model Spatial Audio Descriptor

関連論文リスト