Tempered Self-Similarity Alignment for Physically Plausible Video Generation
Abstract Overview
This paper proposes Tempered Self-Similarity Alignment (TSA) to improve the physical plausibility of generated videos by transferring spatio-temporal relational structure from a pretrained visual foundation model into a video diffusion model. Rather than aligning raw spatio-temporal self-similarity tensors, the method converts them into temperature-scaled correspondence probability distributions and aligns the diffusion model to the foundation model with a KL-divergence objective. The authors also introduce a masked variant, M-TSA, that restricts alignment to motion-salient dynamic regions so supervision focuses on physically meaningful motion rather than static background areas. Experiments on VideoPhy, VideoPhy2, and VBench indicate that this alignment improves physical consistency while largely preserving overall video quality.
Novelty
The distinctive idea is to reinterpret spatio-temporal self-similarity as probabilistic spatio-temporal correspondence distributions and to sharpen them with temperature scaling before alignment, instead of directly matching raw self-similarity tensors. The work also adds motion-focused masking so the transfer of relational knowledge is concentrated on dynamic regions where physical interactions occur.
Results
On VideoPhy, M-TSA improves the overall Physical Commonsense score of CogVideoX-2B* from 25.3 to 30.8, while TSA reaches 29.1; the largest reported category gain is in solid-solid interactions, where PC rises from 15.4 to 21.0 with M-TSA. On VideoPhy2, the Joint score increases from 22.9 for CogVideoX-2B* to 24.4 with M-TSA, and VBench shows a small Total Score increase from 81.0 to 81.2, suggesting the method improves physical plausibility without materially degrading general video quality.
Key Points
- TSA aligns temperature-scaled spatio-temporal correspondence distributions derived from self-similarity, providing more targeted motion supervision than direct raw STSS alignment.
- The masked version, M-TSA, selects dynamic regions using temporal-difference-based motion saliency and excludes static regions from the alignment loss.
- Across reported benchmarks, the method improves physical plausibility metrics over the CogVideoX-2B baseline and prior alignment baselines while maintaining comparable overall VBench quality.