Fugu-MT 論文翻訳(概要): ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

論文の概要: ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

arxiv url: http://arxiv.org/abs/2605.10106v1
Date: Mon, 11 May 2026 07:20:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.59684
Title: ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
Title（参考訳）: ViSRA:マルチモーダル大言語モデルのためのビデオベース空間推論エージェント
Authors: Tingshu Mou, Jiabo He, Renying Wang, Ce Liu, Hao Yang, Tiehua Zhang, Jingjing Chen, Xingjun Ma,
Abstract要約: ViSRAは、MLLMの空間的推論機構を調査するためのトレーニング不要のフレームワークである。これは、専門家モデルからの明示的な空間情報を活用することにより、モジュラーでキュレートされた方法で空間推論を導く。 1)タスク固有のオーバーフィッティングではなく、人間のアライメントと移動可能な3D理解、(2)重い手作業によるキュレーションデータセットとともに、トレーニング後の計算コストが不要である。
参考スコア（独自算出の注目度）: 38.91282173333918
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-training on curated benchmarks, leaving the inference-time approach relatively underexplored. In this paper, we take a training-free perspective and introduce ViSRA, a human-aligned Video-based Spatial Reasoning Agent, as a framework to probe the spatial reasoning mechanism of MLLMs. ViSRA elicits spatial reasoning in a modular and extensible manner by leveraging explicit spatial information from expert models, enabling a plug-and-play flexible paradigm. ViSRA offers two key advantages: (1) human-aligned and transferable 3D understanding rather than task-specific overfitting; and (2) no post-training computational cost along with heavy manual curation of spatial reasoning datasets. Experimental results demonstrate consistent improvement across a set of MLLMs on both existing benchmarks and unseen 3D spatial reasoning tasks, with ViSRA outperforming baselines by up to a 15.6% and 28.9% absolute margin respectively.
Abstract（参考訳）: 近年のMLLM(Multi-modal Large Language Models)は3次元空間インテリジェンスをターゲットとしているが、その進歩はキュレートされたベンチマークのポストトレーニングによって大きく推し進められ、推論時間のアプローチは比較的過小評価されている。本稿では, MLLMの空間的推論機構を探索する枠組みとして, 人間の協調型ビデオベース空間推論エージェントViSRAを紹介する。 ViSRAは、専門家モデルからの明示的な空間情報を活用することで、モジュール的で拡張可能な方法で空間推論を可能にし、プラグアンドプレイのフレキシブルパラダイムを実現する。 ViSRAは,(1)タスク固有のオーバーフィッティングではなく,人間の協調的かつ伝達可能な3D理解,(2)空間推論データセットの重い手作業によるキュレーションとともに,学習後の計算コストを伴わない,という2つの大きな利点を提供する。実験の結果、既存のベンチマークと見当たらない3D空間推論タスクのMLLMのセットで一貫した改善が示され、ViSRAはそれぞれ15.6%と28.9%の絶対マージンでベースラインを上回った。

論文の概要: ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

関連論文リスト