Related papers: ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

URL: http://arxiv.org/abs/2601.04342v1
Date: Wed, 07 Jan 2026 19:26:30 GMT
Title: ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers
Authors: Mohsen Ghafoorian, Amirhossein Habibian,
Abstract summary: ReHyAt is a hybrid attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention.<n>Our experiments demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear.
Score: 10.830662834634879
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt's hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.

Related papers

Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts [27.8245634187787]
We present HALO, a pipeline for distilling Transformer models into RNN-attention hybrid models.<n>We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme.<n>The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data.
arXiv Detail & Related papers (2026-01-29T18:59:53Z)
Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation [69.57572900337176]
We introduce Reward Forcing, a novel framework for efficient streaming video generation.<n> EMA-Sink tokens capture both long-term context and recent dynamics, preventing initial frame copying.<n>Re-DMD biases the model's output distribution toward high-reward regions by prioritizing samples with greater dynamics rated by a vision-language model.
arXiv Detail & Related papers (2025-12-04T11:12:13Z)
Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer [13.545000689565732]
Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention.<n>We introduce Attention Surgery, an efficient framework for linearizing or hybridizing attention in pretrained VDMs without training from scratch.
arXiv Detail & Related papers (2025-09-29T15:09:51Z)
VideoMAR: Autoregressive Video Generatio with Continuous Tokens [33.906543515428424]
Masked-based autoregressive models have demonstrated promising image generation capability in continuous space.<n>We propose textbfVideoMAR, a decoder-only autoregressive image-to-video model with continuous tokens.<n>VideoMAR surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters.
arXiv Detail & Related papers (2025-06-17T04:08:18Z)
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z)
Training-Free Efficient Video Generation via Dynamic Token Carving [54.52061549312799]
Jenga is an inference pipeline that combines dynamic attention carving with progressive resolution generation.<n>As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware.
arXiv Detail & Related papers (2025-05-22T16:21:32Z)
Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis [50.77548592888096]
Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals.<n>Turbo2K is an efficient framework for generating detail-rich 2K videos.
arXiv Detail & Related papers (2025-04-20T03:30:59Z)
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [48.35054927704544]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z)
LinFusion: 1 GPU, 1 Minute, 16K Image [71.44735417472043]
We introduce a low-rank approximation of a wide spectrum of popular linear token mixers. We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD. Experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation.
arXiv Detail & Related papers (2024-09-03T17:54:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.