FuguReport

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

Authors Haosong Peng, Hao Li, Jiaqi Chen, Yuhao Pan, Runmao Yao, Yalun Dai, Fushuo Huo, Fangzhou Hong, Zhaoxi Chen, Haozhao Wang, Dingwen Zhang, Ziwei Liu, Wenchao Xu
Affiliations Northwestern Polytechnical University / Nanyang Technological University / Southeast University / The Hong Kong University of Science and Technology / Huazhong University of Science and Technology
Categories Evaluation / Domain Adaptation Evaluation / Cross-paradigm spatial model testing, Evaluation / Benchmarking / Multi-task spatial benchmark, Method / Data Quality Assessment / Impact of domain alignment on performance
License CC BY 4.0

Abstract Overview

SpatialBench is a reproducible benchmark for evaluating spatial foundation models across paradigms, domains, tasks, and input densities under a deterministic sampling protocol. The benchmark aggregates 19 datasets and 546 scenes across five spatial domains, and evaluates 41 model variants from six paradigms on five task suites under four input-density regimes. The study is designed to compare models fairly across settings such as monocular, sparse-view, medium-overlap, and dense long-sequence inputs, while exposing robustness to domain shift and hardware-related constraints. Across this broad evaluation, the authors conclude that current spatial foundation models are not yet reliable all-round performers, especially under embodied-view domain shifts and long-horizon constraints. To investigate one of the largest observed data gaps, they also introduce the DA-Next-5M dataset and a DA-Next baseline focused on egocentric and wrist-view settings.

Novelty

The paper's main novelty is a standardized, cross-paradigm benchmark that jointly varies scene domain, task type, and deterministic input density for spatial foundation models, enabling direct comparison across six model families under a unified protocol. It also extends beyond benchmarking by introducing a large-scale egocentric and wrist-view dataset, DA-Next-5M, together with a domain-targeted baseline model to probe the identified out-of-distribution gap.

Results

The evaluation finds that full-context feed-forward models define the accuracy upper bound when memory is available, while bounded-memory approaches are more scalable on long sequences but generally sacrifice geometry accuracy. It also shows that data quality and strict domain alignment matter more than simply increasing training data volume, with egocentric and wrist-view domains emerging as the strongest out-of-distribution failure modes. In those embodied-view settings, DA-Next improves over DA3-Giant by 47% and 59% in depth AbsRel on sparse and medium inputs, respectively, and raises pose AUC@30 by 3.1% and 5.5%, supporting the value of targeted in-domain data curation.

Key Points

  1. SpatialBench evaluates 41 model variants from six spatial-model paradigms across 19 datasets, 546 scenes, five task suites, and four deterministic input-density regimes.
  2. The benchmark analysis shows a clear trade-off: full-context attention yields the strongest bounded-input accuracy, whereas streaming, chunked, SLAM, and test-time training methods better handle long sequences under limited memory.
  3. The most severe generalization failures occur in egocentric and wrist-view domains, and the proposed DA-Next-5M dataset plus DA-Next baseline demonstrate that targeted in-domain data can substantially reduce this gap.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.