Related papers: RAIN: Real-time Animation of Infinite Video Stream

RAIN: Real-time Animation of Infinite Video Stream

URL: http://arxiv.org/abs/2412.19489v1
Date: Fri, 27 Dec 2024 07:13:15 GMT
Title: RAIN: Real-time Animation of Infinite Video Stream
Authors: Zhilei Shu, Ruili Feng, Yang Cao, Zheng-Jun Zha,
Abstract summary: RAIN is a pipeline solution capable of animating infinite video streams in real-time with low latency.<n>RAIN generates video frames with much shorter latency and faster speed, while maintaining long-range attention over extended video streams.<n>RAIN can animate characters in real-time with much better quality, accuracy, and consistency than competitors.
Score: 52.97171098038888
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Live animation has gained immense popularity for enhancing online engagement, yet achieving high-quality, real-time, and stable animation with diffusion models remains challenging, especially on consumer-grade GPUs. Existing methods struggle with generating long, consistent video streams efficiently, often being limited by latency issues and degraded visual quality over extended periods. In this paper, we introduce RAIN, a pipeline solution capable of animating infinite video streams in real-time with low latency using a single RTX 4090 GPU. The core idea of RAIN is to efficiently compute frame-token attention across different noise levels and long time-intervals while simultaneously denoising a significantly larger number of frame-tokens than previous stream-based methods. This design allows RAIN to generate video frames with much shorter latency and faster speed, while maintaining long-range attention over extended video streams, resulting in enhanced continuity and consistency. Consequently, a Stable Diffusion model fine-tuned with RAIN in just a few epochs can produce video streams in real-time and low latency without much compromise in quality or consistency, up to infinite long. Despite its advanced capabilities, the RAIN only introduces a few additional 1D attention blocks, imposing minimal additional burden. Experiments in benchmark datasets and generating super-long videos demonstrating that RAIN can animate characters in real-time with much better quality, accuracy, and consistency than competitors while costing less latency. All code and models will be made publicly available.

Related papers

DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion [4.863177884263436]
We present a training-free approach for high FPS video generation using pre-trained diffusion models.<n>Our method, DiffuseSlide, introduces a new pipeline that leverages key frames from low FPS videos and applies innovative techniques, including noise re-injection and sliding window latent denoising.<n>Through extensive experiments, we demonstrate that our approach significantly improves video quality, offering enhanced temporal coherence and spatial fidelity.
arXiv Detail & Related papers (2025-06-02T09:12:41Z)
Training-Free Efficient Video Generation via Dynamic Token Carving [54.52061549312799]
Jenga is an inference pipeline that combines dynamic attention carving with progressive resolution generation.<n>As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware.
arXiv Detail & Related papers (2025-05-22T16:21:32Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining [14.025870185802463]
We present an improved SSMs-based video deraining network (RainMamba) with a novel Hilbert mechanism to better capture sequence-level local information. We also introduce a difference-guided dynamic contrastive locality learning strategy to enhance the patch-level self-similarity learning ability of the proposed network.
arXiv Detail & Related papers (2024-07-31T17:48:22Z)
UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [53.16986875759286]
We present a UniAnimate framework to enable efficient and long-term human video generation. We map the reference image along with the posture guidance and noise video into a common feature space. We also propose a unified noise input that supports random noised input as well as first frame conditioned input.
arXiv Detail & Related papers (2024-06-03T10:51:10Z)
Partial Rewriting for Multi-Stage ASR [14.642804773149713]
We improve the quality of streaming results by around 10%, without altering the final results. Our approach introduces no additional latency and reduces flickering. It is also lightweight, does not require retraining the model, and it can be applied to a wide variety of multi-stage architectures.
arXiv Detail & Related papers (2023-12-08T00:31:43Z)
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z)
FastLLVE: Real-Time Low-Light Video Enhancement with Intensity-Aware Lookup Table [21.77469059123589]
We propose an efficient pipeline named FastLLVE to maintain inter-frame brightness consistency effectively. FastLLVE can process 1,080p videos at $mathit50+$ Frames Per Second (FPS), which is $mathit2 times$ faster than CNN-based methods in inference time.
arXiv Detail & Related papers (2023-08-13T11:54:14Z)
You Can Ground Earlier than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query. Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames. We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time. This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs) We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.