Related papers: StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

URL: http://arxiv.org/abs/2505.22246v2
Date: Thu, 26 Jun 2025 12:10:36 GMT
Title: StateSpaceDiffuser: Bringing Long Context to Diffusion World Models
Authors: Nedko Savov, Naser Kazemi, Deheng Zhang, Danda Pani Paudel, Xi Wang, Luc Van Gool,
Abstract summary: We introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model.<n>This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models.<n>Experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline.
Score: 53.05314852577144
License: http://creativecommons.org/licenses/by/4.0/
Abstract: World models have recently become promising tools for predicting realistic visuals based on actions in complex environments. However, their reliance on only a few recent observations leads them to lose track of the long-term context. Consequently, in just a few steps the generated scenes drift from what was previously observed, undermining the temporal coherence of the sequence. This limitation of the state-of-the-art world models, most of which rely on diffusion, comes from their lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history. This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to reinstantiate seen content in extended rollouts. Comprehensive experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline, maintaining a coherent visual context for an order of magnitude more steps. It delivers consistent views in both a 2D maze navigation and a complex 3D environment. These results establish that bringing state-space representations into diffusion models is highly effective in demonstrating both visual details and long-term memory.

Related papers

Holistic Surgical Phase Recognition with Hierarchical Input Dependent State Space Models [56.2236083600999]
We propose a novel hierarchical input-dependent state space model for surgical video analysis.<n>Our framework incorporates a temporally consistent visual feature extractor, which appends a state space model head to a visual feature extractor to propagate temporal information.<n> Experiments have shown that our method outperforms the current state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-06-26T14:43:57Z)
DeepVerse: 4D Autoregressive Video Generation as a World Model [16.877309608945566]
We introduce DeepVerse, a novel 4D interactive world model explicitly incorporating geometric predictions from previous timesteps into current predictions on actions.<n> Experiments demonstrate that by incorporating explicit geometric constraints, DeepVerse captures richer-temporal relationships and underlying physical dynamics.<n>This capability significantly reduces drift and enhances temporal consistency, enabling the model to reliably generate extended future sequences.
arXiv Detail & Related papers (2025-06-01T17:58:36Z)
FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z)
Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models [71.63194926457119]
We introduce Dynamical Diffusion (DyDiff), a theoretically sound framework that incorporates temporally aware forward and reverse processes.<n>Experiments across scientifictemporal forecasting, video prediction, and time series forecasting demonstrate that Dynamical Diffusion consistently improves performance in temporal predictive tasks.
arXiv Detail & Related papers (2025-03-02T16:10:32Z)
CoDiff: Conditional Diffusion Model for Collaborative 3D Object Detection [9.28605575548509]
Collaborative 3D object detection holds significant importance in the field of autonomous driving.<n>Due to pose estimation errors and time delays, the fusion of information across agents often results in feature representations with spatial and temporal noise.<n>We propose CoDiff, a novel robust collaborative perception framework.
arXiv Detail & Related papers (2025-02-17T03:20:52Z)
EDELINE: Enhancing Memory in Diffusion-based World Models via Linear-Time Sequence Modeling [8.250616459360684]
We introduce EDELINE, a unified world model architecture that integrates state space models with diffusion models.<n>Our approach outperforms existing baselines across visually challenging Atari 100k tasks, memory-demanding benchmark, and 3D first-person ViZDoom environments.
arXiv Detail & Related papers (2025-02-01T15:49:59Z)
Exploring Representation-Aligned Latent Space for Better Generation [86.45670422239317]
We introduce ReaLS, which integrates semantic priors to improve generation performance.<n>We show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric.<n>The enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
arXiv Detail & Related papers (2025-02-01T07:42:12Z)
Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models [65.82564074712836]
We introduce DIFfusionHOI, a new HOI detector shedding light on text-to-image diffusion models. We first devise an inversion-based strategy to learn the expression of relation patterns between humans and objects in embedding space. These learned relation embeddings then serve as textual prompts, to steer diffusion models generate images that depict specific interactions.
arXiv Detail & Related papers (2024-10-26T12:00:33Z)
Discrete Diffusion Language Model for Efficient Text Summarization [19.267738861590487]
We introduce a novel semantic-aware noising process that enables Transformer backbones to handle long sequences effectively.<n>Our approaches achieve state-of-the-art performance on three benchmark summarization datasets: Gigaword, CNN/DailyMail, and Arxiv.
arXiv Detail & Related papers (2024-06-25T09:55:22Z)
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling. It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences. It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z)
Diffusion Models Without Attention [110.5623058129782]
Diffusion State Space Model (DiffuSSM) is an architecture that supplants attention mechanisms with a more scalable state space model backbone. Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward.
arXiv Detail & Related papers (2023-11-30T05:15:35Z)
Decoupling Long- and Short-Term Patterns in Spatiotemporal Inference [31.245426664456257]
It is impractical to deploy massive sensors due to the expensive costs. How to get fine-grained data measurement has long been a pressing issue. We propose a graphtemporal attention network to learn the relations across space and time for short-term patterns.
arXiv Detail & Related papers (2021-09-16T03:06:31Z)
Unsupervised Video Decomposition using Spatio-temporal Iterative Inference [31.97227651679233]
Multi-object scene decomposition is a fast-emerging problem in learning. We show that our model has a high accuracy even without color information. We demonstrate the decomposition, segmentation prediction capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets.
arXiv Detail & Related papers (2020-06-25T22:57:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.