LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding
- URL: http://arxiv.org/abs/2510.15392v1
- Date: Fri, 17 Oct 2025 07:45:43 GMT
- Title: LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding
- Authors: Peng Ren, Hai Yang,
- Abstract summary: LILAC builds upon a recent high-performing offline framework for arbitrary motion stylization.<n>It extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design.<n>This architecture enables real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture.
- Score: 5.946860384629338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating long and stylized human motions in real time is critical for applications that demand continuous and responsive character control. Despite its importance, existing streaming approaches often operate directly in the raw motion space, leading to substantial computational overhead and making it difficult to maintain temporal stability. In contrast, latent-space VAE-Diffusion-based frameworks alleviate these issues and achieve high-quality stylization, but they are generally confined to offline processing. To bridge this gap, LILAC (Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding) builds upon a recent high-performing offline framework for arbitrary motion stylization and extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design and the injection of decoded motion features to ensure smooth motion transitions. This architecture enables long-sequence real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture, achieving a favorable balance between stylization quality and responsiveness as demonstrated by experiments on benchmark datasets. Supplementary video and examples are available at the project page: https://pren1.github.io/lilac/
Related papers
- Causal Motion Diffusion Models for Autoregressive Motion Generation [19.61051102039212]
Causal Motion Diffusion Models (CMDM) is a unified framework for autoregressive motion generation.<n> CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations.<n>Experiments on HumanML3D and SnapMoGen demonstrate CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness.
arXiv Detail & Related papers (2026-02-26T03:58:25Z) - Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation [16.692450893925148]
We present a novel streaming framework named Knot Forcing for real-time portrait animation.<n>K Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences.
arXiv Detail & Related papers (2025-12-25T16:34:56Z) - Characterizing Motion Encoding in Video Diffusion Timesteps [50.13907856401258]
We study how motion is encoded in video diffusion timesteps by the trade-off between appearance editing and motion preservation.<n>We identify an early, motion-dominant regime and a later, appearance-dominant regime, yielding an operational motion-appearance boundary in timestep space.
arXiv Detail & Related papers (2025-12-18T21:20:54Z) - Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length [57.458450695137664]
We present Live Avatar, an algorithm-system co-designed framework for efficient, high-fidelity, and infinite-length avatar generation.<n>Live Avatar is first to achieve practical, real-time, high-fidelity avatar generation at this scale.
arXiv Detail & Related papers (2025-12-04T11:11:24Z) - Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events [71.2439653098351]
Continuous space-time video super-STVSR has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary temporal scales.<n>We present EvEnhancer, a novel approach that marries unique properties of high temporal and high dynamic range encapsulated in event streams.<n>Our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining generalizability at OOD scales.
arXiv Detail & Related papers (2025-10-04T15:23:07Z) - Rolling Forcing: Autoregressive Long Video Diffusion in Real Time [86.40480237741609]
Rolling Forcing is a novel video generation technique that enables streaming long videos with minimal error accumulation.<n>Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme.<n>Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor.<n>Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows.
arXiv Detail & Related papers (2025-09-29T17:57:14Z) - Feature-aligned Motion Transformation for Efficient Dynamic Point Cloud Compression [96.50160784402338]
We introduce a Motion Transformation Feature (FMT) framework for dynamic point cloud compression.<n>FMT replaces explicit motion vectors with an alignment strategy that implicitly models continuous temporal variations.<n>Our method surpasses D-DPCC and AdaDPCC in both encoding and decoding efficiency, while also achieving BD-Rate reductions of 20% and 9.4%.
arXiv Detail & Related papers (2025-09-18T03:51:06Z) - Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z) - MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space [40.60429652169086]
Text-conditioned streaming motion generation requires us to predict the next-step human pose based on variable-length historical motions and incoming texts.<n>Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths.<n>We propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model.
arXiv Detail & Related papers (2025-03-19T17:32:24Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - Progressive Temporal Feature Alignment Network for Video Inpainting [51.26380898255555]
Video convolution aims to fill in-temporal "corrupted regions" with plausible content.
Current methods achieve this goal through attention, flow-based warping, or 3D temporal convolution.
We propose 'Progressive Temporal Feature Alignment Network', which progressively enriches features extracted from the current frame with the warped feature from neighbouring frames.
arXiv Detail & Related papers (2021-04-08T04:50:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.