FuguReport

Fast Spatial Memory with Elastic Test-Time Training

Authors Ziqiao Ma, Xueyang Yu, Haoyu Zhen, Yuncong Yang, Joyce Chai, Chuang Gan
Affiliations MIT-IBM Watson AI Lab / University of Massachusetts Amherst / University of Michigan
Categories Method / Test-Time Training / Elastic weight regularization technique, Task / 3D Reconstruction / Long-term spatial reconstruction, Evaluation / Memory Mechanisms / Performance in spatial memory usage
License CC BY 4.0

Abstract Overview

This paper proposes Elastic Test-Time Training (LaCET), an extension of Large Chunk Test-Time Training (LaCT) that stabilizes inference-time fast-weight updates for long-context 3D and 4D reconstruction by introducing a Fisher-weighted elastic prior around dynamically maintained anchor weights. The anchor weights evolve via a streaming exponential moving average to balance plasticity and stability across chunks, addressing catastrophic forgetting and overfitting in fully plastic LaCT. Building on LaCET, the authors introduce Fast Spatial Memory (FSM), a scalable model that learns spatiotemporal representations from long sequences of posed images and renders novel views at novel times. The paper presents both LVSM-style (direct view synthesis) and LRM-style (explicit Gaussian-based) decoder variants, and evaluates them through ablations on Stereo4D and benchmark experiments on 3D (DL3DV) and 4D (Stereo4D, NVIDIA) novel view synthesis tasks.

Novelty

The main novelty is reformulating test-time training for long-sequence reconstruction as an elastic fast-weight process, where chunk-wise adaptation is regularized by Fisher-style importance estimates (with EWC, MAS, and SI variants) and dynamically maintained streaming-EMA anchors, drawing on elastic weight consolidation from continual learning. The work also introduces FSM as the first large-scale 4D reconstruction model designed to accept long sequences of posed images with arbitrary timestamps and render novel view-time combinations using this LaCET mechanism.

Results

In ablations on Stereo4D, elastic training with streaming-EMA anchors substantially improves the 4-chunk setting over vanilla LaCT: the streaming-EMA + MAS variant reaches 29.928 PSNR, 0.0519 LPIPS, and 0.898 SSIM versus 26.908, 0.0988, and 0.814 for non-elastic 4-chunk LaCT. Analysis shows reduced overfitting and less reliance on camera-interpolation shortcuts under sparse input settings. On benchmark evaluations, FSM-LVSM achieves 32.16 PSNR on Stereo4D and 23.90 PSNR on NVIDIA at 256×256 resolution, outperforming prior feed-forward methods, while remaining competitive on static-scene DL3DV (26.69 PSNR at 256×256).

Key Points

  1. Elastic Test-Time Training (LaCET) adds a consolidation step after chunk-wise fast-weight updates, using Fisher-style importance estimates (EWC, MAS, or SI variants) and streaming-EMA anchor weights to limit fast-weight drift and mitigate catastrophic forgetting at inference time.
  2. The proposed Fast Spatial Memory (FSM) model uses LaCET blocks to learn scene-level spatiotemporal representations from long posed-image sequences, supporting both LVSM-style direct view synthesis and LRM-style Gaussian splatting rendering for arbitrary novel view-time queries.
  3. Ablation experiments demonstrate that elastic multi-chunk adaptation with streaming-EMA anchors substantially improves generalization and reconstruction quality over fully plastic LaCT, particularly reducing overfitting and camera-interpolation shortcuts in long-sequence and dynamic-scene settings.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.