FuguReport

Tempered Self-Similarity Alignment for Physically Plausible Video Generation

Authors Manjin Kim, Suha Kwak, Minsu Cho
Affiliations Pohang University of Science and Technology
Categories Method / Video Generation / Dynamic region correspondence training, Evaluation / Physical Plausibility / Assessing physical consistency in video, Application / Interaction Modeling / Relation knowledge transfer in video
License CC BY 4.0

Abstract Overview

This paper proposes Tempered Self-Similarity Alignment (TSA) to improve the physical plausibility of generated videos by transferring spatio-temporal relational structure from a pretrained visual foundation model into a video diffusion model. Rather than aligning raw spatio-temporal self-similarity tensors, the method converts them into temperature-scaled correspondence probability distributions and aligns the diffusion model to the foundation model with a KL-divergence objective. The authors also introduce a masked variant, M-TSA, that restricts alignment to motion-salient dynamic regions so supervision focuses on physically meaningful motion rather than static background areas. Experiments on VideoPhy, VideoPhy2, and VBench indicate that this alignment improves physical consistency while largely preserving overall video quality.

Novelty

The distinctive idea is to reinterpret spatio-temporal self-similarity as probabilistic spatio-temporal correspondence distributions and to sharpen them with temperature scaling before alignment, instead of directly matching raw self-similarity tensors. The work also adds motion-focused masking so the transfer of relational knowledge is concentrated on dynamic regions where physical interactions occur.

Results

On VideoPhy, M-TSA improves the overall Physical Commonsense score of CogVideoX-2B* from 25.3 to 30.8, while TSA reaches 29.1; the largest reported category gain is in solid-solid interactions, where PC rises from 15.4 to 21.0 with M-TSA. On VideoPhy2, the Joint score increases from 22.9 for CogVideoX-2B* to 24.4 with M-TSA, and VBench shows a small Total Score increase from 81.0 to 81.2, suggesting the method improves physical plausibility without materially degrading general video quality.

Key Points

  1. TSA aligns temperature-scaled spatio-temporal correspondence distributions derived from self-similarity, providing more targeted motion supervision than direct raw STSS alignment.
  2. The masked version, M-TSA, selects dynamic regions using temporal-difference-based motion saliency and excludes static regions from the alignment loss.
  3. Across reported benchmarks, the method improves physical plausibility metrics over the CogVideoX-2B baseline and prior alignment baselines while maintaining comparable overall VBench quality.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.