Related papers: SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation with Structured State Spaces

SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation with Structured State Spaces

URL: http://arxiv.org/abs/2403.07711v4
Date: Tue, 3 Sep 2024 09:24:20 GMT
Title: SSM Meets Video Diffusion Models: Efficient Long-Term Video Generation with Structured State Spaces
Authors: Yuta Oshima, Shohei Taniguchi, Masahiro Suzuki, Yutaka Matsuo,
Abstract summary: Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features. This limitation presents significant challenges when generating longer video sequences using diffusion models. We propose leveraging state-space models (SSMs) as temporal feature extractors.
Score: 20.23192934634197
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Given the remarkable achievements in image generation through diffusion models, the research community has shown increasing interest in extending these models to video generation. Recent diffusion models for video generation have predominantly utilized attention layers to extract temporal features. However, attention layers are limited by their computational costs, which increase quadratically with the sequence length. This limitation presents significant challenges when generating longer video sequences using diffusion models. To overcome this challenge, we propose leveraging state-space models (SSMs) as temporal feature extractors. SSMs (e.g., Mamba) have recently gained attention as promising alternatives due to their linear-time memory consumption relative to sequence length. In line with previous research suggesting that using bidirectional SSMs is effective for understanding spatial features in image generation, we found that bidirectionality is also beneficial for capturing temporal features in video data, rather than relying on traditional unidirectional SSMs. We conducted comprehensive evaluations on multiple long-term video datasets, such as MineRL Navigate, across various model sizes. For sequences up to 256 frames, SSM-based models require less memory to achieve the same FVD as attention-based models. Moreover, SSM-based models often deliver better performance with comparable GPU memory usage. Our codes are available at https://github.com/shim0114/SSM-Meets-Video-Diffusion-Models.

Related papers

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering [46.199493246921435]
Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information. We introduce BIMBA, an efficient state-space model to handle long-form videos.
arXiv Detail & Related papers (2025-03-12T17:57:32Z)
Pushing the Boundaries of State Space Models for Image and Video Generation [26.358592737557956]
We build the largest-scale diffusion SSM-Transformer hybrid model to date (5B parameters) based on the sub-quadratic bi-directional Hydra and self-attention. Our results demonstrate that the model can produce faithful results aligned with complex text prompts and temporal consistent videos with high dynamics.
arXiv Detail & Related papers (2025-02-03T00:51:09Z)
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z)
Treating Brain-inspired Memories as Priors for Diffusion Model to Forecast Multivariate Time Series [16.315066774520524]
We get inspiration from humans' memory mechanisms to better capture temporal patterns. Brain-inspired memory comprises semantic and episodic memory. We present a brain-inspired memory-augmented diffusion model.
arXiv Detail & Related papers (2024-09-27T07:09:40Z)
DyG-Mamba: Continuous State Space Modeling on Dynamic Graphs [59.434893231950205]
Dynamic graph learning aims to uncover evolutionary laws in real-world systems. We propose DyG-Mamba, a new continuous state space model for dynamic graph learning. We show that DyG-Mamba achieves state-of-the-art performance on most datasets.
arXiv Detail & Related papers (2024-08-13T15:21:46Z)
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences. We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook. LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z)
MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection [5.37935922811333]
MambaMixer is a new architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels. As a proof of concept, we design Vision MambaMixer (ViM2) and Time Series MambaMixer (TSM2) architectures based on the MambaMixer block.
arXiv Detail & Related papers (2024-03-29T00:05:13Z)
S2DM: Sector-Shaped Diffusion Models for Video Generation [2.0270353391739637]
We propose a novel Sector-Shaped Diffusion Model (S2DM) for video generation. S2DM can generate a group of intrinsically related data sharing the same semantic and intrinsically related features. We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works.
arXiv Detail & Related papers (2024-03-20T08:50:15Z)
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [71.11425812806431]
Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands. Here, we apply the LDM paradigm to high-resolution generation, a particularly resource-intensive task. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling.
arXiv Detail & Related papers (2023-04-18T08:30:32Z)
Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model. We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z)
Video Probabilistic Diffusion Models in Projected Latent Space [75.4253202574722]
We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM) PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
arXiv Detail & Related papers (2023-02-15T14:22:34Z)
Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space. We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.