DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training
- URL: http://arxiv.org/abs/2512.17323v1
- Date: Fri, 19 Dec 2025 08:12:20 GMT
- Title: DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training
- Authors: Jiyun Kong, Jun-Hyuk Kim, Jong-Seok Lee,
- Abstract summary: Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes.<n>We propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training.
- Score: 25.438410354399053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.
Related papers
- UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models [67.24086328473437]
Event cameras excel at recording relative intensity changes rather than absolute intensity.<n>The resulting data streams suffer from a significant loss of spatial information and static texture details.<n>We address this limitation by leveraging a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data.
arXiv Detail & Related papers (2026-02-22T14:06:49Z) - EvDiff: High Quality Video with an Event Camera [77.07279880903009]
Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness.<n>We propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos.
arXiv Detail & Related papers (2025-11-21T18:49:18Z) - EVDI++: Event-based Video Deblurring and Interpolation via Self-Supervised Learning [36.86635176661841]
We introduce EVDI++, a self-supervised framework for Event-based Video Deblurring and Interpolation.<n>We use the high temporal resolution of event cameras to mitigate motion blur and enable intermediate frame prediction.<n>A self-supervised learning framework is proposed to enable network training with real-world blurry videos and events.
arXiv Detail & Related papers (2025-09-10T03:36:24Z) - Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models [63.99949971803903]
We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation.<n>FramePack compresses input frame contexts with frame-wise importance so that more frames can be encoded within a fixed context length.<n>We show that existing video diffusion models can be finetuned with FramePack, and analyze the differences between different packing schedules.
arXiv Detail & Related papers (2025-04-17T04:02:31Z) - CMTA: Cross-Modal Temporal Alignment for Event-guided Video Deblurring [44.30048301161034]
Video deblurring aims to enhance the quality of restored results in motion-red videos by gathering information from adjacent video frames.
We propose two modules: 1) Intra-frame feature enhancement operates within the exposure time of a single blurred frame, and 2) Inter-frame temporal feature alignment gathers valuable long-range temporal information to target frames.
We demonstrate that our proposed methods outperform state-of-the-art frame-based and event-based motion deblurring methods through extensive experiments conducted on both synthetic and real-world deblurring datasets.
arXiv Detail & Related papers (2024-08-27T10:09:17Z) - Revisiting Event-based Video Frame Interpolation [49.27404719898305]
Dynamic vision sensors or event cameras provide rich complementary information for video frame.
estimating optical flow from events is arguably more difficult than from RGB information.
We propose a divide-and-conquer strategy in which event-based intermediate frame synthesis happens incrementally in multiple simplified stages.
arXiv Detail & Related papers (2023-07-24T06:51:07Z) - A Unified Framework for Event-based Frame Interpolation with Ad-hoc Deblurring in the Wild [72.0226493284814]
We propose a unified framework for event-based frame that performs deblurring ad-hoc.<n>Our network consistently outperforms previous state-of-the-art methods on frame, single image deblurring, and the joint task of both.
arXiv Detail & Related papers (2023-01-12T18:19:00Z) - Unifying Motion Deblurring and Frame Interpolation with Events [11.173687810873433]
Slow shutter speed and long exposure time of frame-based cameras often cause visual blur and loss of inter-frame information, degenerating the overall quality of captured videos.
We present a unified framework of event-based motion deblurring and frame enhancement for blurry video enhancement, where the extremely low latency of events is leveraged to alleviate motion blur and facilitate intermediate frame prediction.
By exploring the mutual constraints among blurry frames, latent images, and event streams, we further propose a self-supervised learning framework to enable network training with real-world blurry videos and events.
arXiv Detail & Related papers (2022-03-23T03:43:12Z) - TimeLens: Event-based Video Frame Interpolation [54.28139783383213]
We introduce Time Lens, a novel indicates equal contribution method that leverages the advantages of both synthesis-based and flow-based approaches.
We show an up to 5.21 dB improvement in terms of PSNR over state-of-the-art frame-based and event-based methods.
arXiv Detail & Related papers (2021-06-14T10:33:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.