FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge
- URL: http://arxiv.org/abs/2505.14709v1
- Date: Sat, 17 May 2025 05:00:39 GMT
- Title: FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge
- Authors: Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu,
- Abstract summary: Auto-regressive (AR) models have recently shown promise in visual generation tasks due to their superior sampling efficiency.<n>Video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase.<n>We propose the textbfFastCar framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy.
- Score: 60.000984252907195
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Auto-regressive (AR) models, initially successful in language generation, have recently shown promise in visual generation tasks due to their superior sampling efficiency. Unlike image generation, video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase. Our key observations are: (i) MLP modules in the decode phase dominate the inference latency, and (ii) there exists high temporal redundancy in MLP outputs of adjacent frames. In this paper, we propose the \textbf{FastCar} framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy. The Temporal Attention Score (TAS) is proposed to determine whether to apply the replay strategy (\textit{i.e.}, reusing cached MLP outputs from the previous frame to reduce redundant computations) with detailed theoretical analysis and justification. Also, we develop a hardware accelerator on FPGA with Dynamic Resource Scheduling (DRS) based on TAS to enable better resource utilization and faster inference. Experimental results demonstrate the effectiveness of our method, which outperforms traditional sparse attention approaches with more than 2.1x decoding speedup and higher energy efficiency on the edge. Furthermore, by combining FastCar and sparse attention, FastCar can boost the performance of sparse attention with alleviated drifting, demonstrating our unique advantages for high-resolution and long-duration video generation. Code: https://github.com/shawnricecake/fast-car
Related papers
- FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution [68.77813885751308]
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information.<n>We propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data.<n>Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames.
arXiv Detail & Related papers (2025-06-13T07:59:52Z) - Astraea: A GPU-Oriented Token-wise Acceleration Framework for Video Diffusion Transformers [22.349130691342687]
Video diffusion transformers (vDiTs) have made impressive progress in text-to-video generation, but their high computational demands present major challenges for practical deployment.<n>We introduce ASTRAEA, an automatic framework that searches for near-optimal configurations for vDiT-based video generation.
arXiv Detail & Related papers (2025-06-05T14:41:38Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Cascaded Temporal Updating Network for Efficient Video Super-Resolution [47.63267159007611]
Key components in recurrent-based VSR networks significantly impact model efficiency.
We propose a cascaded temporal updating network (CTUN) for efficient VSR.
CTUN achieves a favorable trade-off between efficiency and performance compared to existing methods.
arXiv Detail & Related papers (2024-08-26T12:59:32Z) - Boosting Neural Representations for Videos with a Conditional Decoder [28.073607937396552]
Implicit neural representations (INRs) have emerged as a promising approach for video storage and processing.
This paper introduces a universal boosting framework for current implicit video representation approaches.
arXiv Detail & Related papers (2024-02-28T08:32:19Z) - VNVC: A Versatile Neural Video Coding Framework for Efficient
Human-Machine Vision [59.632286735304156]
It is more efficient to enhance/analyze the coded representations directly without decoding them into pixels.
We propose a versatile neural video coding (VNVC) framework, which targets learning compact representations to support both reconstruction and direct enhancement/analysis.
arXiv Detail & Related papers (2023-06-19T03:04:57Z) - ReBotNet: Fast Real-time Video Enhancement [59.08038313427057]
Most restoration networks are slow, have high computational bottleneck, and can't be used for real-time video enhancement.
In this work, we design an efficient and fast framework to perform real-time enhancement for practical use-cases like live video calls and video streams.
To evaluate our method, we emulate two new datasets that real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
arXiv Detail & Related papers (2023-03-23T17:58:05Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.