NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors
- URL: http://arxiv.org/abs/2504.11427v1
- Date: Tue, 15 Apr 2025 17:39:07 GMT
- Title: NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors
- Authors: Yanrui Bin, Wenbo Hu, Haoyuan Wang, Xinya Chen, Bing Wang,
- Abstract summary: We present NormalCrafter to leverage the inherent temporal priors of video diffusion models.<n>To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization.<n>We also introduce a two-stage training protocol to preserve spatial accuracy while maintaining long temporal context.
- Score: 6.7253553166914255
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications. While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge. Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models. To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context. Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.
Related papers
- STeP: A General and Scalable Framework for Solving Video Inverse Problems with Spatiotemporal Diffusion Priors [27.45644471304381]
We study how to solve general inverse problems involving videos using diffusion model priors.<n>We introduce a general and scalable framework for solving video inverse problems.
arXiv Detail & Related papers (2025-04-10T08:24:26Z) - Weakly Supervised Video Anomaly Detection and Localization with Spatio-Temporal Prompts [57.01985221057047]
This paper introduces a novel method that learnstemporal prompt embeddings for weakly supervised video anomaly detection and localization (WSVADL) based on pre-trained vision-language models (VLMs)
Our method achieves state-of-theart performance on three public benchmarks for the WSVADL task.
arXiv Detail & Related papers (2024-08-12T03:31:29Z) - Learning Natural Consistency Representation for Face Forgery Video Detection [23.53549629885891]
We propose to learn the Natural representation (NACO) real face videos in a self-supervised manner.
Our method outperforms other state-of-the-art methods with impressive robustness.
arXiv Detail & Related papers (2024-07-15T09:00:02Z) - Learning Temporally Consistent Video Depth from Video Diffusion Priors [62.36887303063542]
This work addresses the challenge of streamed video depth estimation.<n>We argue that sharing contextual information between frames or clips is pivotal in fostering temporal consistency.<n>Specifically, we propose a consistent context-aware training and inference strategy for arbitrarily long videos to provide cross-clip context.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - Spatial Decomposition and Temporal Fusion based Inter Prediction for
Learned Video Compression [59.632286735304156]
We propose a spatial decomposition and temporal fusion based inter prediction for learned video compression.
With the SDD-based motion model and long short-term temporal fusion, our proposed learned video can obtain more accurate inter prediction contexts.
arXiv Detail & Related papers (2024-01-29T03:30:21Z) - Latent Spatiotemporal Adaptation for Generalized Face Forgery Video Detection [22.536129731902783]
We propose a Latemporal Spatio(LAST) approach to facilitate generalized face video detection.
We first model thetemporal patterns face videos by incorporating a lightweight CNN to extract local spatial features of each frame.
Then we learn the long-termtemporal representations in latent space videos, which should contain more clues than in pixel space.
arXiv Detail & Related papers (2023-09-09T13:40:44Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Exploiting Spatial-temporal Correlations for Video Anomaly Detection [7.336831373786849]
Video anomaly detection (VAD) remains a challenging task in the pattern recognition community due to the ambiguity and diversity of abnormal events.
We introduce a discriminator to perform adversarial learning with the ST-LSTM to enhance the learning capability.
Our method achieves competitive performance compared to the state-of-the-art methods with AUCs of 96.7%, 87.8%, and 73.1% on the UCSD2, CUHK Avenue, and ShanghaiTech, respectively.
arXiv Detail & Related papers (2022-11-02T02:13:24Z) - Spatio-temporal predictive tasks for abnormal event detection in videos [60.02503434201552]
We propose new constrained pretext tasks to learn object level normality patterns.
Our approach consists in learning a mapping between down-scaled visual queries and their corresponding normal appearance and motion characteristics.
Experiments on several benchmark datasets demonstrate the effectiveness of our approach to localize and track anomalies.
arXiv Detail & Related papers (2022-10-27T19:45:12Z) - Spatio-Temporal Transformer for Dynamic Facial Expression Recognition in
the Wild [19.5702895176141]
We propose a method for capturing discnative features within each frame model.
We utilize the CNN to translate each frame into a visual feature sequence.
Experiments indicate that our method provides an effective way to make use of the spatial and temporal dependencies.
arXiv Detail & Related papers (2022-05-10T08:47:15Z) - CaSPR: Learning Canonical Spatiotemporal Point Cloud Representations [72.4716073597902]
We propose a method to learn object Canonical Point Cloud Representations of dynamically or moving objects.
We demonstrate the effectiveness of our method on several applications including shape reconstruction, camera pose estimation, continuoustemporal sequence reconstruction, and correspondence estimation.
arXiv Detail & Related papers (2020-08-06T17:58:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.