Pushing the Boundaries of State Space Models for Image and Video Generation
- URL: http://arxiv.org/abs/2502.00972v1
- Date: Mon, 03 Feb 2025 00:51:09 GMT
- Title: Pushing the Boundaries of State Space Models for Image and Video Generation
- Authors: Yicong Hong, Long Mai, Yuan Yao, Feng Liu,
- Abstract summary: We build the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention.
Our results demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics.
- Score: 26.358592737557956
- License:
- Abstract: While Transformers have become the dominant architecture for visual generation, linear attention models, such as the state-space models (SSM), are increasingly recognized for their efficiency in processing long visual sequences. However, the essential efficiency of these models comes from formulating a limited recurrent state, enforcing causality among tokens that are prone to inconsistent modeling of N-dimensional visual data, leaving questions on their capacity to generate long non-causal sequences. In this paper, we explore the boundary of SSM on image and video generation by building the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention, and generate up to 2K images and 360p 8 seconds (16 FPS) videos. Our results demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics, suggesting the great potential of using SSMs for visual generation tasks.
Related papers
- CascadeV: An Implementation of Wurstchen Architecture for Video Generation [4.086317089863318]
We propose a cascaded latent diffusion model (LDM) that is capable of producing state-of-the-art 2K resolution videos.
Experiments demonstrate that our cascaded model achieves a higher compression ratio, substantially reducing the computational challenges associated with high-quality video generation.
Our model can be cascaded with existing T2V models, theoretically enabling a 4$times$ increase in resolution or frames per second without any fine-tuning.
arXiv Detail & Related papers (2025-01-28T01:14:24Z) - Four-Plane Factorized Video Autoencoders [44.00676320678128]
We propose an autoencoder that projects data onto a four-plane factorized latent space that grows sublinearly with the input size.
Our results show that the proposed four-plane latent space retains a rich representation needed for high-fidelity reconstructions.
arXiv Detail & Related papers (2024-12-05T18:58:17Z) - ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation.
A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens.
An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z) - Efficient Visual State Space Model for Image Deblurring [83.57239834238035]
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
We propose a simple yet effective visual state space model (EVSSM) for image deblurring.
arXiv Detail & Related papers (2024-05-23T09:13:36Z) - S2DM: Sector-Shaped Diffusion Models for Video Generation [2.0270353391739637]
We propose a novel Sector-Shaped Diffusion Model (S2DM) for video generation.
S2DM can generate a group of intrinsically related data sharing the same semantic and intrinsically related features.
We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works.
arXiv Detail & Related papers (2024-03-20T08:50:15Z) - SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation with Structured State Spaces [20.23192934634197]
Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features.
This limitation presents significant challenges when generating longer video sequences using diffusion models.
We propose leveraging state-space models (SSMs) as temporal feature extractors.
arXiv Detail & Related papers (2024-03-12T14:53:56Z) - Align your Latents: High-Resolution Video Synthesis with Latent
Diffusion Models [71.11425812806431]
Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands.
Here, we apply the LDM paradigm to high-resolution generation, a particularly resource-intensive task.
We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling.
arXiv Detail & Related papers (2023-04-18T08:30:32Z) - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model.
We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z) - Video Probabilistic Diffusion Models in Projected Latent Space [75.4253202574722]
We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM)
PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
arXiv Detail & Related papers (2023-02-15T14:22:34Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.