Fugu-MT 論文翻訳(概要): Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

論文の概要: Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

arxiv url: http://arxiv.org/abs/2603.12255v1
Date: Thu, 12 Mar 2026 17:58:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.292681
Title: Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training
Title（参考訳）: 空間TTT:テストタイムトレーニングによる視覚的空間知能のストリーミング
Authors: Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li, Han Hu, Yongming Rao, Yueqi Duan,
Abstract要約: 我々は、テストタイムトレーニング(TTT)を用いた視覚的空間知能のストリーミングに向けた空間TTTを提案する。我々はハイブリッドアーキテクチャを設計し、スライディング・ウインドウ・アテンションと平行に大きな時間的更新を適用し、効率的な空間ビデオ処理を行う。実験により,空間TTTは時間軸空間理解を向上し,映像空間ベンチマークにおける最先端性能を実現することが示された。
参考スコア（独自算出の注目度）: 61.6942259866261
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D spatiotemporal convolution, which encourages the model to capture geometric correspondence and temporal continuity across frames. Beyond architecture design, we construct a dataset with dense 3D spatial descriptions, which guides the model to update its fast weights to memorize and organize global 3D spatial signals in a structured manner. Extensive experiments demonstrate that Spatial-TTT improves long-horizon spatial understanding and achieves state-of-the-art performance on video spatial benchmarks. Project page: https://liuff19.github.io/Spatial-TTT.
Abstract（参考訳）: 人間は視覚的な観察の流れを通して現実世界の空間を知覚し、理解する。したがって,空間的インテリジェンスには,潜在的に非有界な映像ストリームから空間的証拠をストリーミング的に維持・更新する能力が不可欠である。最大の課題は、単にコンテキストウィンドウを長くするだけでなく、空間情報がどのように選択され、整理され、時間とともに保持されるかである。本稿では,テスト時間トレーニング(TTT)を用いた視覚的空間知能のストリーミングに向けた空間TTTを提案する。具体的には、ハイブリッドアーキテクチャを設計し、スライディングウインドウの注意と並行して大きなチャンク更新を適用し、効率的な空間ビデオ処理を行う。空間認識をさらに促進するために,3次元時空間畳み込みを伴うTTT層に適用した空間予測機構を導入する。アーキテクチャ設計の他に、密集した3次元空間記述を持つデータセットを構築し、高速な重みを更新し、グローバルな3次元空間信号を構造化的に記憶・整理する。広汎な実験により、空間TTTは長距離空間理解を改善し、ビデオ空間ベンチマークにおける最先端性能を実現する。プロジェクトページ:https://liuff19.github.io/Spatial-TTT。

論文の概要: Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

関連論文リスト