Fast Spatial Memory with Elastic Test-Time Training
Abstract Overview
This paper proposes Elastic Test-Time Training (LaCET), an extension of Large Chunk Test-Time Training (LaCT) that stabilizes inference-time fast-weight updates for long-context 3D and 4D reconstruction by introducing a Fisher-weighted elastic prior around dynamically maintained anchor weights. The anchor weights evolve via a streaming exponential moving average to balance plasticity and stability across chunks, addressing catastrophic forgetting and overfitting in fully plastic LaCT. Building on LaCET, the authors introduce Fast Spatial Memory (FSM), a scalable model that learns spatiotemporal representations from long sequences of posed images and renders novel views at novel times. The paper presents both LVSM-style (direct view synthesis) and LRM-style (explicit Gaussian-based) decoder variants, and evaluates them through ablations on Stereo4D and benchmark experiments on 3D (DL3DV) and 4D (Stereo4D, NVIDIA) novel view synthesis tasks.
Novelty
The main novelty is reformulating test-time training for long-sequence reconstruction as an elastic fast-weight process, where chunk-wise adaptation is regularized by Fisher-style importance estimates (with EWC, MAS, and SI variants) and dynamically maintained streaming-EMA anchors, drawing on elastic weight consolidation from continual learning. The work also introduces FSM as the first large-scale 4D reconstruction model designed to accept long sequences of posed images with arbitrary timestamps and render novel view-time combinations using this LaCET mechanism.
Results
In ablations on Stereo4D, elastic training with streaming-EMA anchors substantially improves the 4-chunk setting over vanilla LaCT: the streaming-EMA + MAS variant reaches 29.928 PSNR, 0.0519 LPIPS, and 0.898 SSIM versus 26.908, 0.0988, and 0.814 for non-elastic 4-chunk LaCT. Analysis shows reduced overfitting and less reliance on camera-interpolation shortcuts under sparse input settings. On benchmark evaluations, FSM-LVSM achieves 32.16 PSNR on Stereo4D and 23.90 PSNR on NVIDIA at 256×256 resolution, outperforming prior feed-forward methods, while remaining competitive on static-scene DL3DV (26.69 PSNR at 256×256).
Key Points
- Elastic Test-Time Training (LaCET) adds a consolidation step after chunk-wise fast-weight updates, using Fisher-style importance estimates (EWC, MAS, or SI variants) and streaming-EMA anchor weights to limit fast-weight drift and mitigate catastrophic forgetting at inference time.
- The proposed Fast Spatial Memory (FSM) model uses LaCET blocks to learn scene-level spatiotemporal representations from long posed-image sequences, supporting both LVSM-style direct view synthesis and LRM-style Gaussian splatting rendering for arbitrary novel view-time queries.
- Ablation experiments demonstrate that elastic multi-chunk adaptation with streaming-EMA anchors substantially improves generalization and reconstruction quality over fully plastic LaCT, particularly reducing overfitting and camera-interpolation shortcuts in long-sequence and dynamic-scene settings.
References
- arXiv: https://arxiv.org/abs/2604.07350v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.07350v1
- Hugging Face Papers: https://huggingface.co/papers/2604.07350
- Project: https://fast-spatial-memory.github.io/