Fugu-MT 論文翻訳(概要): LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

論文の概要: LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

arxiv url: http://arxiv.org/abs/2606.05677v1
Date: Thu, 04 Jun 2026 04:00:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.540727
Title: LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video
Title（参考訳）: LongSpace: ビデオにおける知覚からリコールまでの長距離空間記憶の探索
Authors: Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, Yuanteng Chen, Tao Liu, Lan Yang, Longteng Guo, Honggang Zhang,
Abstract要約: 自律走行やロボットナビゲーションのような長距離作業は、現在の視点を認識する以上のものを必要とします。長距離空間記憶のためのルームツーリングビデオベンチマークであるLongSpaceを紹介する。 LongSpaceは、長いビデオをシーケンシャルなチャンクとしてモデル化し、3D構造的キューを初期のデコーダ層に組み込み、質問誘導検索のためのレイヤ対応メモリを構築する。
参考スコア（独自算出の注目度）: 20.1389583507481
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は高度な画像理解とビデオ理解を持ち、より長い視覚的入力を扱うことができる。自律走行やロボットナビゲーションのような長距離タスクは、これまで観測された空間配置、ルート、視点の変化、オブジェクト状態などを記憶し、取得する必要があるため、現在のビューを認識する以上のものを必要としている。この能力を評価するために,LongSpace-Benchという,長距離空間記憶のためのルームツーリングビデオベンチマークを導入し,シーン認識,空間関係,空間記憶について紹介する。本研究では,長ビデオ空間推論のためのメモリフレームワークであるLongSpaceを提案する。 LongSpaceは、長いビデオをシーケンシャルなチャンクとしてモデル化し、3D構造的キューを初期のデコーダ層に組み込み、質問誘導検索のためのレイヤ対応メモリを構築する。複数の空間推論ベンチマークによる実験により、LongSpaceは長時間空間理解を改善し、さらに、長距離ビデオMLLMの重要な機能として明示的な空間記憶を示す。

論文の概要: LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

関連論文リスト