Fugu-MT 論文翻訳(概要): SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

論文の概要: SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

arxiv url: http://arxiv.org/abs/2604.21190v2
Date: Tue, 28 Apr 2026 07:02:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 14:06:43.797687
Title: SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning
Title（参考訳）: SpatiO:空間推論のための視覚言語エージェントの適応的テスト時間オーケストレーション
Authors: Chan Yeong Hwang, Miso Choi, Sunghyun On, Jinkyu Kim, Jungbeom Lee,
Abstract要約: 空間的推論には、入力に応じて異なる戦略を柔軟に調整する必要がある。既存のアプローチのほとんどは、固定空間を暗黙的に学習する単一の推論パイプラインに依存している。空間推論のための異種多エージェントフレームワークであるSpatiOを導入し、複数の視覚言語の専門家と相補的帰納バイアスを協調する。
参考スコア（独自算出の注目度）: 18.3204772691015
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric constraints, whose reliability varies across contexts. This suggests that effective spatial reasoning requires \emph{spatial adaptability}: the ability to flexibly coordinate different reasoning strategies depending on the input. However, most existing approaches rely on a single reasoning pipeline that implicitly learns a fixed spatial prior, limiting their ability to adapt under distribution changes. Multi-agent systems offer a promising alternative by aggregating diverse reasoning trajectories, but prior attempts in spatial reasoning primarily employ homogeneous agents, restricting the diversity of inductive biases they can leverage. In this work, we introduce SpatiO, a heterogeneous multi-agent framework for spatial reasoning that coordinates multiple vision-language specialists with complementary inductive biases. To enable effective collaboration, we propose Test-Time Orchestration (TTO), an optimization mechanism that dynamically evaluates and reweights agents based on their observed reliability during inference, without modifying model parameters. Extensive experiments on diverse spatial reasoning benchmarks, including 3DSRBench, STVQA-7k, CV-Bench, and Omni3D-Bench, demonstrate that SpatiO consistently improves spatial reasoning performance over both closed-source and open-source baselines.
Abstract（参考訳）: 視覚的なシーンを理解するには、物体を認識するだけでなく、それらの空間的関係を推論する必要がある。一般的な視覚言語タスクとは異なり、空間的推論は2次元の外観の手がかり、深度信号、幾何的制約などの複数の帰納バイアスを統合する必要があり、その信頼性は状況によって異なる。このことは、効果的な空間推論には、入力に応じて異なる推論戦略を柔軟に調整する能力である 'emph{spatial adaptability} が必要であることを示唆している。しかし、既存のほとんどのアプローチは、固定された空間事前を暗黙的に学習する単一の推論パイプラインに依存しており、分布変化の下で適応する能力を制限する。マルチエージェントシステムは、様々な推論軌道を集約することで、有望な代替手段を提供するが、空間的推論の以前の試みは、主に均質なエージェントを使用し、それらが活用できる誘導バイアスの多様性を制限する。本研究では、空間的推論のための異種多エージェントフレームワークであるSpatiOを紹介し、複数の視覚言語スペシャリストと相補的帰納的バイアスを協調する。モデルパラメータを変更することなく、推論中に観測された信頼性に基づいてエージェントを動的に評価・重み付けする最適化機構であるテスト時間オーケストレーション(TTO)を提案する。 3DSRBench、STVQA-7k、CV-Bench、Omni3D-Benchを含む様々な空間推論ベンチマークに関する大規模な実験は、SpatiOがクローズドソースベースラインとオープンソースベースラインの両方に対して一貫して空間推論性能を改善することを示した。

論文の概要: SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

関連論文リスト