RAIN: Real-time Animation of Infinite Video Stream
- URL: http://arxiv.org/abs/2412.19489v1
- Date: Fri, 27 Dec 2024 07:13:15 GMT
- Title: RAIN: Real-time Animation of Infinite Video Stream
- Authors: Zhilei Shu, Ruili Feng, Yang Cao, Zheng-Jun Zha,
- Abstract summary: RAIN is a pipeline solution capable of animating infinite video streams in real-time with low latency.
RAIN generates video frames with much shorter latency and faster speed, while maintaining long-range attention over extended video streams.
RAIN can animate characters in real-time with much better quality, accuracy, and consistency than competitors.
- Score: 52.97171098038888
- License:
- Abstract: Live animation has gained immense popularity for enhancing online engagement, yet achieving high-quality, real-time, and stable animation with diffusion models remains challenging, especially on consumer-grade GPUs. Existing methods struggle with generating long, consistent video streams efficiently, often being limited by latency issues and degraded visual quality over extended periods. In this paper, we introduce RAIN, a pipeline solution capable of animating infinite video streams in real-time with low latency using a single RTX 4090 GPU. The core idea of RAIN is to efficiently compute frame-token attention across different noise levels and long time-intervals while simultaneously denoising a significantly larger number of frame-tokens than previous stream-based methods. This design allows RAIN to generate video frames with much shorter latency and faster speed, while maintaining long-range attention over extended video streams, resulting in enhanced continuity and consistency. Consequently, a Stable Diffusion model fine-tuned with RAIN in just a few epochs can produce video streams in real-time and low latency without much compromise in quality or consistency, up to infinite long. Despite its advanced capabilities, the RAIN only introduces a few additional 1D attention blocks, imposing minimal additional burden. Experiments in benchmark datasets and generating super-long videos demonstrating that RAIN can animate characters in real-time with much better quality, accuracy, and consistency than competitors while costing less latency. All code and models will be made publicly available.
Related papers
- Real-Time Neural-Enhancement for Online Cloud Gaming [31.971805571638942]
We introduce River, a cloud gaming delivery framework based on the observation that video segment features in cloud gaming are typically repetitive and redundant.
River builds a content-aware encoder that fine-tunes SR models for diverse video segments and stores them in a lookup table.
When delivering cloud gaming video streams online, River checks the video features and retrieves the most relevant SR models to enhance the frame quality.
arXiv Detail & Related papers (2025-01-12T17:28:09Z) - RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining [14.025870185802463]
We present an improved SSMs-based video deraining network (RainMamba) with a novel Hilbert mechanism to better capture sequence-level local information.
We also introduce a difference-guided dynamic contrastive locality learning strategy to enhance the patch-level self-similarity learning ability of the proposed network.
arXiv Detail & Related papers (2024-07-31T17:48:22Z) - Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio.
We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z) - UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [53.16986875759286]
We present a UniAnimate framework to enable efficient and long-term human video generation.
We map the reference image along with the posture guidance and noise video into a common feature space.
We also propose a unified noise input that supports random noised input as well as first frame conditioned input.
arXiv Detail & Related papers (2024-06-03T10:51:10Z) - Partial Rewriting for Multi-Stage ASR [14.642804773149713]
We improve the quality of streaming results by around 10%, without altering the final results.
Our approach introduces no additional latency and reduces flickering.
It is also lightweight, does not require retraining the model, and it can be applied to a wide variety of multi-stage architectures.
arXiv Detail & Related papers (2023-12-08T00:31:43Z) - FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference.
This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts.
We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z) - FastLLVE: Real-Time Low-Light Video Enhancement with Intensity-Aware
Lookup Table [21.77469059123589]
We propose an efficient pipeline named FastLLVE to maintain inter-frame brightness consistency effectively.
FastLLVE can process 1,080p videos at $mathit50+$ Frames Per Second (FPS), which is $mathit2 times$ faster than CNN-based methods in inference time.
arXiv Detail & Related papers (2023-08-13T11:54:14Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.