MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion
- URL: http://arxiv.org/abs/2412.09828v1
- Date: Fri, 13 Dec 2024 03:39:09 GMT
- Title: MSC: Multi-Scale Spatio-Temporal Causal Attention for Autoregressive Video Diffusion
- Authors: Xunnong Xu, Mengying Cao,
- Abstract summary: We propose a Multi-Scale Causal (MSC) framework to address these problems.<n>We introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation.<n>We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion transformers enable flexible generative modeling for video. However, it is still technically challenging and computationally expensive to generate high-resolution videos with rich semantics and complex motion. Similar to languages, video data are also auto-regressive by nature, so it is counter-intuitive to use attention mechanism with bi-directional dependency in the model. Here we propose a Multi-Scale Causal (MSC) framework to address these problems. Specifically, we introduce multiple resolutions in the spatial dimension and high-low frequencies in the temporal dimension to realize efficient attention calculation. Furthermore, attention blocks on multiple scales are combined in a controlled way to allow causal conditioning on noisy image frames for diffusion training, based on the idea that noise destroys information at different rates on different resolutions. We theoretically show that our approach can greatly reduce the computational complexity and enhance the efficiency of training. The causal attention diffusion framework can also be used for auto-regressive long video generation, without violating the natural order of frame sequences.
Related papers
- FCA2: Frame Compression-Aware Autoencoder for Modular and Fast Compressed Video Super-Resolution [68.77813885751308]
State-of-the-art (SOTA) compressed video super-resolution (CVSR) models face persistent challenges, including prolonged inference time, complex training pipelines, and reliance on auxiliary information.<n>We propose an efficient and scalable solution inspired by the structural and statistical similarities between hyperspectral images (HSI) and video data.<n>Our approach introduces a compression-driven dimensionality reduction strategy that reduces computational complexity, accelerates inference, and enhances the extraction of temporal information across frames.
arXiv Detail & Related papers (2025-06-13T07:59:52Z) - Training-Free Efficient Video Generation via Dynamic Token Carving [54.52061549312799]
Jenga is an inference pipeline that combines dynamic attention carving with progressive resolution generation.<n>As a plug-and-play solution, Jenga enables practical, high-quality video generation on modern hardware.
arXiv Detail & Related papers (2025-05-22T16:21:32Z) - Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions [94.21989689001848]
We propose (Delta)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ((Delta)ConvBlocks)
By distilling attention patterns into localized convolutional operations while keeping other components frozen, (Delta)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929$times$ and surpassing LinFusion by 5.42$times$ in efficiency--all without compromising generative fidelity.
arXiv Detail & Related papers (2025-04-30T03:57:28Z) - AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion [19.98565541640125]
We introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible video generation.
Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames.
This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence.
arXiv Detail & Related papers (2025-03-10T15:05:59Z) - ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
Continuous visual generation requires the full-sequence diffusion-based approach.<n>We present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer.<n>We demonstrate that ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - Multimodal Instruction Tuning with Hybrid State Space Models [25.921044010033267]
Long context is crucial for enhancing the recognition and understanding capabilities of multimodal large language models.
We propose a novel approach using a hybrid transformer-MAMBA model to efficiently handle long contexts in multimodal applications.
Our model enhances inference efficiency for high-resolution images and high-frame-rate videos by about 4 times compared to current models.
arXiv Detail & Related papers (2024-11-13T18:19:51Z) - Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models [56.691967706131]
We view frames as continuous functions in the 2D space, and videos as a sequence of continuous warping transformations between different frames.
This perspective allows us to train function space diffusion models only on images and utilize them to solve temporally correlated inverse problems.
Our method allows us to deploy state-of-the-art latent diffusion models such as Stable Diffusion XL to solve video inverse problems.
arXiv Detail & Related papers (2024-10-21T16:19:34Z) - Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models [6.408114351192012]
Video models require extensive training and computational resources, leading to high costs and large environmental impacts.
This paper introduces a novel approach to video generation by augmenting image diffusion models to create sequential animation frames while maintaining fine detail.
arXiv Detail & Related papers (2024-10-05T12:53:05Z) - Denoising Reuse: Exploiting Inter-frame Motion Consistency for Efficient Video Latent Generation [36.098738197088124]
This work presents a Diffusion Reuse MOtion network to accelerate latent video generation.
coarse-grained noises in earlier denoising steps have demonstrated high motion consistency across consecutive video frames.
Dr. Mo propagates those coarse-grained noises onto the next frame by incorporating carefully designed, lightweight inter-frame motions.
arXiv Detail & Related papers (2024-09-19T07:50:34Z) - DemMamba: Alignment-free Raw Video Demoireing with Frequency-assisted Spatio-Temporal Mamba [18.06907326360215]
Moire patterns, resulting from the interference of two similar repetitive patterns, are frequently observed during the capture of images or videos on screens.
This paper introduces a novel alignment-free raw video demoireing network with frequency-assisted-temporal Mamba.
Our proposed DemMamba surpasses state-of-the-art methods by 1.3 dB in PSNR, and also provides a satisfactory visual experience.
arXiv Detail & Related papers (2024-08-20T09:31:03Z) - Multi-Hierarchical Surrogate Learning for Structural Dynamical Crash
Simulations Using Graph Convolutional Neural Networks [5.582881461692378]
We propose a multi-hierarchical framework for structurally creating a series of surrogate models for a kart frame.
For multiscale phenomena, macroscale features are captured on a coarse surrogate, whereas microscale effects are resolved by finer ones.
We train a graph-convolutional neural network-based surrogate that learns parameter-dependent low-dimensional latent dynamics on the coarsest representation.
arXiv Detail & Related papers (2024-02-14T15:22:59Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space.
We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced.
Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - Autoencoding Video Latents for Adversarial Video Generation [0.0]
AVLAE is a two stream latent autoencoder where the video distribution is learned by adversarial training.
We demonstrate that our approach learns to disentangle motion and appearance codes even without the explicit structural composition in the generator.
arXiv Detail & Related papers (2022-01-18T11:42:14Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.