SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation with Structured State Spaces
- URL: http://arxiv.org/abs/2403.07711v4
- Date: Tue, 3 Sep 2024 09:24:20 GMT
- Title: SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation with Structured State Spaces
- Authors: Yuta Oshima, Shohei Taniguchi, Masahiro Suzuki, Yutaka Matsuo,
- Abstract summary: Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features.
This limitation presents significant challenges when generating longer video sequences using diffusion models.
We propose leveraging state-space models (SSMs) as temporal feature extractors.
- Score: 20.23192934634197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given the remarkable achievements in image generation through diffusion models, the research community has shown increasing interest in extending these models to video generation. Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features. However, attention layers are limited by their computational costs, which increase quadratically with the sequence length. This limitation presents significant challenges when generating longer video sequences using diffusion models. To overcome this challenge, we propose leveraging state-space models (SSMs) as temporal feature extractors. SSMs (e.g., Mamba) have recently gained attention as promising alternatives due to their linear-time memory consumption relative to sequence length. In line with previous research suggesting that using bidirectional SSMs is effective for understanding spatial features in image generation, we found that bidirectionality is also beneficial for capturing temporal features in video data, rather than relying on traditional unidirectional SSMs. We conducted comprehensive evaluations on multiple long-term video datasets, such as MineRL Navigate, across various model sizes. For sequences up to 256 frames, SSM-based models require less memory to achieve the same FVD as attention-based models. Moreover, SSM-based models often deliver better performance with comparable GPU memory usage. Our codes are available at https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models.
Related papers
- MS-SSM: A Multi-Scale State Space Model for Efficient Sequence Modeling [60.648359990090846]
State-space models (SSMs) have recently attention as an efficient alternative to computationally expensive attention-based models for sequence modeling.<n>This paper introduces a multi-scale SSM framework that represents sequence dynamics across multiple resolution and processing each resolution with specialized state-space dynamics.
arXiv Detail & Related papers (2025-12-29T19:36:28Z) - DiM-TS: Bridge the Gap between Selective State Space Models and Time Series for Generative Modeling [11.836475971106125]
Time series data plays a pivotal role in a wide variety of fields but faces challenges related to privacy concerns.<n>We propose Lag Fusion Mamba and Permutation Scanning Mamba, which enhance the model's ability to discern significant patterns during the denoising process.<n>We also introduce Diffusion Mamba for Time Series (DiM-TS), a high-quality time series generation model that better preserves the temporal periodicity and inter-channel correlations.
arXiv Detail & Related papers (2025-11-23T06:48:03Z) - Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention [40.10862285690496]
We propose a novel Recurrent Autoregressive Diffusion (RAD) framework, which executes frame-wise autoregression for memory update and retrieval.<n>Experiments on Memory and Minecraft datasets demonstrate the superiority of RAD for long video generation.
arXiv Detail & Related papers (2025-11-17T03:47:12Z) - Uniform Discrete Diffusion with Metric Path for Video Generation [103.86033350602908]
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-duration inconsistency.<n>We present Uniform generative modeling and present Uniform pAth (URSA), a powerful framework that bridges the gap with continuous approaches for scalable video generation.<n>URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods.
arXiv Detail & Related papers (2025-10-28T17:59:57Z) - Long-Context State-Space Video World Models [66.28743632951218]
We propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency.<n>Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory.<n>Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory.
arXiv Detail & Related papers (2025-05-26T16:12:41Z) - BIMBA: Selective-Scan Compression for Long-Range Video Question Answering [46.199493246921435]
Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information.
We introduce BIMBA, an efficient state-space model to handle long-form videos.
arXiv Detail & Related papers (2025-03-12T17:57:32Z) - Pushing the Boundaries of State Space Models for Image and Video Generation [26.358592737557956]
We build the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention.
Our results demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics.
arXiv Detail & Related papers (2025-02-03T00:51:09Z) - Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.
Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.
By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z) - Treating Brain-inspired Memories as Priors for Diffusion Model to Forecast Multivariate Time Series [16.315066774520524]
We get inspiration from humans' memory mechanisms to better capture temporal patterns.
Brain-inspired memory comprises semantic and episodic memory.
We present a brain-inspired memory-augmented diffusion model.
arXiv Detail & Related papers (2024-09-27T07:09:40Z) - DyG-Mamba: Continuous State Space Modeling on Dynamic Graphs [59.434893231950205]
Dynamic graph learning aims to uncover evolutionary laws in real-world systems.
We propose DyG-Mamba, a new continuous state space model for dynamic graph learning.
We show that DyG-Mamba achieves state-of-the-art performance on most datasets.
arXiv Detail & Related papers (2024-08-13T15:21:46Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection [5.37935922811333]
MambaMixer is a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels.
As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block.
arXiv Detail & Related papers (2024-03-29T00:05:13Z) - S2DM: Sector-Shaped Diffusion Models for Video Generation [2.0270353391739637]
We propose a novel Sector-Shaped Diffusion Model (S2DM) for video generation.
S2DM can generate a group of intrinsically related data sharing the same semantic and intrinsically related features.
We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works.
arXiv Detail & Related papers (2024-03-20T08:50:15Z) - Align your Latents: High-Resolution Video Synthesis with Latent
Diffusion Models [71.11425812806431]
Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands.
Here, we apply the LDM paradigm to high-resolution generation, a particularly resource-intensive task.
We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling.
arXiv Detail & Related papers (2023-04-18T08:30:32Z) - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model.
We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z) - Video Probabilistic Diffusion Models in Projected Latent Space [75.4253202574722]
We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM)
PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
arXiv Detail & Related papers (2023-02-15T14:22:34Z) - Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space.
We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced.
Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.