Related papers: Facing Off World Model Backbones: RNNs, Transformers, and S4

Facing Off World Model Backbones: RNNs, Transformers, and S4

URL: http://arxiv.org/abs/2307.02064v2
Date: Thu, 9 Nov 2023 16:50:43 GMT
Title: Facing Off World Model Backbones: RNNs, Transformers, and S4
Authors: Fei Deng, Junyeong Park, Sungjin Ahn
Abstract summary: World models are a fundamental component in model-based reinforcement learning (MBRL) We propose S4WM, the first world model compatible with parallelizable SSMs including S4 and its variants. Our findings demonstrate that S4WM outperforms Transformer-based world models in terms of long-term memory, while exhibiting greater efficiency during training and imagination.
Score: 24.818868307093766
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: World models are a fundamental component in model-based reinforcement learning (MBRL). To perform temporally extended and consistent simulations of the future in partially observable environments, world models need to possess long-term memory. However, state-of-the-art MBRL agents, such as Dreamer, predominantly employ recurrent neural networks (RNNs) as their world model backbone, which have limited memory capacity. In this paper, we seek to explore alternative world model backbones for improving long-term memory. In particular, we investigate the effectiveness of Transformers and Structured State Space Sequence (S4) models, motivated by their remarkable ability to capture long-range dependencies in low-dimensional sequences and their complementary strengths. We propose S4WM, the first world model compatible with parallelizable SSMs including S4 and its variants. By incorporating latent variable modeling, S4WM can efficiently generate high-dimensional image sequences through latent imagination. Furthermore, we extensively compare RNN-, Transformer-, and S4-based world models across four sets of environments, which we have tailored to assess crucial memory capabilities of world models, including long-term imagination, context-dependent recall, reward prediction, and memory-based reasoning. Our findings demonstrate that S4WM outperforms Transformer-based world models in terms of long-term memory, while exhibiting greater efficiency during training and imagination. These results pave the way for the development of stronger MBRL agents.

Related papers

BrainSymphony: A Transformer-Driven Fusion of fMRI Time Series and Structural Connectivity [2.3486335708866606]
BrainSymphony is a lightweight, parameter-efficient foundation model for neuroimaging.<n>It achieves state-of-the-art performance while being pre-trained on significantly smaller public datasets.<n>BrainSymphony establishes that architecturally-aware, multimodal models can surpass their larger counterparts.
arXiv Detail & Related papers (2025-06-23T06:00:21Z)
Video World Models with Long-term Spatial Memory [110.530715838396]
We introduce a novel framework to enhance long-term consistency of video world models.<n>Our framework includes mechanisms to store and retrieve information from the long-term spatial memory.<n>Our evaluations show improved quality, consistency, and context length compared to relevant baselines.
arXiv Detail & Related papers (2025-06-05T17:42:34Z)
Building spatial world models from sparse transitional episodic memories [1.3459777108901956]
We show that a neural network can learn to construct a spatial model of its surroundings from sparse and disjoint episodic memories.<n>We show that Episodic Spatial World Model (ESWM) is highly sample-efficient, requiring minimal observations to construct a robust representation of the environment.
arXiv Detail & Related papers (2025-05-19T19:56:24Z)
World Model-Based Learning for Long-Term Age of Information Minimization in Vehicular Networks [53.98633183204453]
In this paper, a novel world model-based learning framework is proposed to minimize packet-completeness-aware age of information (CAoI) in a vehicular network.<n>A world model framework is proposed to jointly learn a dynamic model of the mmWave V2X environment and use it to imagine trajectories for learning how to perform link scheduling.<n>In particular, the long-term policy is learned in differentiable imagined trajectories instead of environment interactions.
arXiv Detail & Related papers (2025-05-03T06:23:18Z)
SegResMamba: An Efficient Architecture for 3D Medical Image Segmentation [2.979183050755201]
We propose an efficient 3D segmentation model for medical imaging called SegResMamba. Our model uses less than half the memory during training compared to other state-of-the-art (SOTA) architectures.
arXiv Detail & Related papers (2025-03-10T18:40:28Z)
Accelerating Model-Based Reinforcement Learning with State-Space World Models [18.71404724458449]
Reinforcement learning (RL) is a powerful approach for robot learning. However, model-free RL (MFRL) requires a large number of environment interactions to learn successful control policies. We propose a new method for accelerating model-based RL using state-space world models.
arXiv Detail & Related papers (2025-02-27T15:05:25Z)
DMWM: Dual-Mind World Model with Long-Term Imagination [53.98633183204453]
We propose a novel dual-mind world model (DMWM) framework that integrates logical reasoning to enable imagination with logical consistency. The proposed framework is evaluated on benchmark tasks that require long-term planning from the DMControl suite.
arXiv Detail & Related papers (2025-02-11T14:40:57Z)
EDELINE: Enhancing Memory in Diffusion-based World Models via Linear-Time Sequence Modeling [8.250616459360684]
We introduce EDELINE, a unified world model architecture that integrates state space models with diffusion models.<n>Our approach outperforms existing baselines across visually challenging Atari 100k tasks, memory-demanding benchmark, and 3D first-person ViZDoom environments.
arXiv Detail & Related papers (2025-02-01T15:49:59Z)
4D Gaussian Splatting: Modeling Dynamic Scenes with Native 4D Primitives [116.2042238179433]
In this paper, we frame dynamic scenes as unconstrained 4D volume learning problems. We represent a target dynamic scene using a collection of 4D Gaussian primitives with explicit geometry and appearance features. This approach can capture relevant information in space and time by fitting the underlying photorealistic-temporal volume. Notably, our 4DGS model is the first solution that supports real-time rendering of high-resolution, novel views for complex dynamic scenes.
arXiv Detail & Related papers (2024-12-30T05:30:26Z)
Automatically Learning Hybrid Digital Twins of Dynamical Systems [56.69628749813084]
Digital Twins (DTs) simulate the states and temporal dynamics of real-world systems. DTs often struggle to generalize to unseen conditions in data-scarce settings. In this paper, we propose an evolutionary algorithm ($textbfHDTwinGen$) to autonomously propose, evaluate, and optimize HDTwins.
arXiv Detail & Related papers (2024-10-31T07:28:22Z)
FACTS: A Factored State-Space Framework For World Modelling [24.08175276756845]
We propose a novel recurrent framework, the textbfFACTored textbfState-space (textbfFACTS) model, for spatial-temporal world modelling. The FACTS framework constructs a graph-memory with a routing mechanism that learns permutable memory representations. It consistently outperforms or matches specialised state-of-the-art models, despite its general-purpose world modelling design.
arXiv Detail & Related papers (2024-10-28T11:04:42Z)
Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient [9.519619751861333]
We propose a state space model (SSM) based world model based on Mamba. It achieves $O(n)$ memory and computational complexity while effectively capturing long-term dependencies. This model is accessible and can be trained on an off-the-shelf laptop.
arXiv Detail & Related papers (2024-10-11T15:10:40Z)
PoseMamba: Monocular 3D Human Pose Estimation with Bidirectional Global-Local Spatio-Temporal State Space Model [7.286873011001679]
We propose a purely SSM-based approach with linear correlations for complexityD human pose estimation in monocular video video. Specifically, we propose a bidirectional global temporal-local-temporal block that comprehensively models human joint relations within individual frames as well as across frames. This strategy provides a more logical geometric ordering strategy, resulting in a combined-local spatial scan.
arXiv Detail & Related papers (2024-08-07T04:38:03Z)
Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models [106.94827590977337]
We propose a novel world model for Multi-Agent RL (MARL) that learns decentralized local dynamics for scalability. We also introduce a Perceiver Transformer as an effective solution to enable centralized representation aggregation. Results on Starcraft Multi-Agent Challenge (SMAC) show that it outperforms strong model-free approaches and existing model-based methods in both sample efficiency and overall performance.
arXiv Detail & Related papers (2024-06-22T12:40:03Z)
Mastering Memory Tasks with World Models [12.99255437732525]
Current model-based reinforcement learning (MBRL) agents struggle with long-term dependencies. We present a new method, Recall to Imagine (R2I), to improve temporal coherence. R2I establishes a new state-of-the-art for challenging memory and credit assignment RL tasks.
arXiv Detail & Related papers (2024-03-07T06:35:59Z)
FullLoRA: Efficiently Boosting the Robustness of Pretrained Vision Transformers [72.83770102062141]
Vision Transformer (ViT) model has gradually become mainstream in various computer vision tasks.<n>Existing large models tend to prioritize performance during training, potentially neglecting the robustness.<n>We develop novel LNLoRA module, incorporating a learnable layer normalization before the conventional LoRA module.<n>We propose the FullLoRA framework by integrating the learnable LNLoRA modules into all key components of ViT-based models.
arXiv Detail & Related papers (2024-01-03T14:08:39Z)
Convolutional State Space Models for Long-Range Spatiotemporal Modeling [65.0993000439043]
ConvS5 is an efficient variant for long-rangetemporal modeling. It significantly outperforms Transformers and ConvNISTTM on a long horizon Moving-Lab experiment while training 3X faster than ConvLSTM and generating samples 400X faster than Transformers.
arXiv Detail & Related papers (2023-10-30T16:11:06Z)
Deep Latent State Space Models for Time-Series Generation [68.45746489575032]
We propose LS4, a generative model for sequences with latent variables evolving according to a state space ODE. Inspired by recent deep state space models (S4), we achieve speedups by leveraging a convolutional representation of LS4. We show that LS4 significantly outperforms previous continuous-time generative models in terms of marginal distribution, classification, and prediction scores on real-world datasets.
arXiv Detail & Related papers (2022-12-24T15:17:42Z)
TransDreamer: Reinforcement Learning with Transformer World Models [33.34909288732319]
We propose a transformer-based Model-Based Reinforcement Learning agent, called TransDreamer. We first introduce the Transformer State-Space Model, a world model that leverages a transformer for dynamics predictions. We then share this world model with a transformer-based policy network and obtain stability in training a transformer-based RL agent. In experiments, we apply the proposed model to 2D visual RL and 3D first-person visual RL tasks both requiring long-range memory access for memory-based reasoning. We show that the proposed model outperforms Dreamer in these complex tasks.
arXiv Detail & Related papers (2022-02-19T00:30:52Z)
STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data. Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.