Related papers: UniZero: Generalized and Efficient Planning with Scalable Latent World Models

UniZero: Generalized and Efficient Planning with Scalable Latent World Models

URL: http://arxiv.org/abs/2406.10667v1
Date: Sat, 15 Jun 2024 15:24:15 GMT
Title: UniZero: Generalized and Efficient Planning with Scalable Latent World Models
Authors: Yuan Pu, Yazhe Niu, Jiyuan Ren, Zhenjie Yang, Hongsheng Li, Yu Liu,
Abstract summary: We present textitUniZero, a novel approach that textitdisentangles latent states from implicit latent history using a transformer-based latent world model. We demonstrate that UniZero, even with single-frame inputs, matches or surpasses the performance of MuZero-style algorithms on the Atari 100k benchmark.
Score: 29.648382211926364
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning predictive world models is essential for enhancing the planning capabilities of reinforcement learning agents. Notably, the MuZero-style algorithms, based on the value equivalence principle and Monte Carlo Tree Search (MCTS), have achieved superhuman performance in various domains. However, in environments that require capturing long-term dependencies, MuZero's performance deteriorates rapidly. We identify that this is partially due to the \textit{entanglement} of latent representations with historical information, which results in incompatibility with the auxiliary self-supervised state regularization. To overcome this limitation, we present \textit{UniZero}, a novel approach that \textit{disentangles} latent states from implicit latent history using a transformer-based latent world model. By concurrently predicting latent dynamics and decision-oriented quantities conditioned on the learned latent history, UniZero enables joint optimization of the long-horizon world model and policy, facilitating broader and more efficient planning in latent space. We demonstrate that UniZero, even with single-frame inputs, matches or surpasses the performance of MuZero-style algorithms on the Atari 100k benchmark. Furthermore, it significantly outperforms prior baselines in benchmarks that require long-term memory. Lastly, we validate the effectiveness and scalability of our design choices through extensive ablation studies, visual analyses, and multi-task learning results. The code is available at \textcolor{magenta}{https://github.com/opendilab/LightZero}.

Related papers

R-Zero: Self-Evolving Reasoning LLM from Zero Data [56.74402018426378]
Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences.<n>Existing methods for training such models still rely heavily on vast human-curated tasks and labels.<n>We introduce R-Zero, a fully autonomous framework that generates its own training data from scratch.
arXiv Detail & Related papers (2025-08-07T03:38:16Z)
World Model-Based Learning for Long-Term Age of Information Minimization in Vehicular Networks [53.98633183204453]
In this paper, a novel world model-based learning framework is proposed to minimize packet-completeness-aware age of information (CAoI) in a vehicular network.<n>A world model framework is proposed to jointly learn a dynamic model of the mmWave V2X environment and use it to imagine trajectories for learning how to perform link scheduling.<n>In particular, the long-term policy is learned in differentiable imagined trajectories instead of environment interactions.
arXiv Detail & Related papers (2025-05-03T06:23:18Z)
Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo [90.78001821963008]
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. We develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC) Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language.
arXiv Detail & Related papers (2025-04-17T17:49:40Z)
Mamba-CL: Optimizing Selective State Space Model in Null Space for Continual Learning [54.19222454702032]
Continual Learning aims to equip AI models with the ability to learn a sequence of tasks over time, without forgetting previously learned knowledge. State Space Models (SSMs) have achieved notable success in computer vision. We introduce Mamba-CL, a framework that continuously fine-tunes the core SSMs of the large-scale Mamba foundation model.
arXiv Detail & Related papers (2024-11-23T06:36:16Z)
UmambaTSF: A U-shaped Multi-Scale Long-Term Time Series Forecasting Method Using Mamba [7.594115034632109]
We propose UmambaTSF, a novel long-term time series forecasting framework. It integrates multi-scale feature extraction capabilities of U-shaped encoder-decoder multilayer perceptrons (MLP) with Mamba's long sequence representation. UmambaTSF achieves state-of-the-art performance and excellent generality on widely used benchmark datasets.
arXiv Detail & Related papers (2024-10-15T04:56:43Z)
Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient [9.519619751861333]
We propose a state space model (SSM) based world model based on Mamba. It achieves $O(n)$ memory and computational complexity while effectively capturing long-term dependencies. This model is accessible and can be trained on an off-the-shelf laptop.
arXiv Detail & Related papers (2024-10-11T15:10:40Z)
Scaling Offline Model-Based RL via Jointly-Optimized World-Action Model Pretraining [49.730897226510095]
We introduce JOWA: Jointly-Reinforced World-Action model, an offline model-based RL agent pretrained on Atari games with 6 billion tokens data. Our largest agent, with 150 million parameters, 78.9% human-level performance on pretrained games using only 10% subsampled offline data, outperforming existing state-of-the-art large-scale offline RL baselines by 31.6% on averange.
arXiv Detail & Related papers (2024-10-01T10:25:03Z)
Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need [28.301119776877822]
Time series forecasting requires balancing short-term and long-term dependencies for accurate predictions. Transformers are superior in modeling long-term dependencies but are criticized for their quadratic computational cost. Mamba provides a near-linear alternative but is reported less effective in time series longterm forecasting due to potential information loss.
arXiv Detail & Related papers (2024-08-28T17:59:27Z)
MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking [51.28485682954006]
We propose a pure Mamba-based framework (MambaVT) to fully exploit intrinsic-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations. Experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks.
arXiv Detail & Related papers (2024-08-15T02:29:00Z)
DyG-Mamba: Continuous State Space Modeling on Dynamic Graphs [59.434893231950205]
Dynamic graph learning aims to uncover evolutionary laws in real-world systems. We propose DyG-Mamba, a new continuous state space model for dynamic graph learning. We show that DyG-Mamba achieves state-of-the-art performance on most datasets.
arXiv Detail & Related papers (2024-08-13T15:21:46Z)
ZeroG: Investigating Cross-dataset Zero-shot Transferability in Graphs [36.749959232724514]
ZeroG is a new framework tailored to enable cross-dataset generalization. We address the inherent challenges such as feature misalignment, mismatched label spaces, and negative transfer. We propose a prompt-based subgraph sampling module that enriches the semantic information and structure information of extracted subgraphs.
arXiv Detail & Related papers (2024-02-17T09:52:43Z)
A General Framework for Learning from Weak Supervision [93.89870459388185]
This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm. Central to GLWS is an Expectation-Maximization (EM) formulation, adeptly accommodating various weak supervision sources. We also present an advanced algorithm that significantly simplifies the EM computational demands.
arXiv Detail & Related papers (2024-02-02T21:48:50Z)
LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios [32.83545787965431]
Building agents based on tree-search planning capabilities with learned models has achieved remarkable success in classic decision-making problems, such as Go and Atari. It has been deemed challenging or even infeasible to extend Monte Carlo Tree Search (MCTS) based algorithms to diverse real-world applications. In this work, we introduce LightZero, the first unified benchmark for deploying MCTS/MuZero in general sequential decision scenarios.
arXiv Detail & Related papers (2023-10-12T14:18:09Z)
OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive Learning [67.07363529640784]
We propose OpenSTL to categorize prevalent approaches into recurrent-based and recurrent-free models. We conduct standard evaluations on datasets across various domains, including synthetic moving object trajectory, human motion, driving scenes, traffic flow and forecasting weather. We find that recurrent-free models achieve a good balance between efficiency and performance than recurrent models.
arXiv Detail & Related papers (2023-06-20T03:02:14Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
Equivariant MuZero [14.027651496499882]
We propose improving the data efficiency and generalisation capabilities of MuZero by explicitly incorporating the symmetries of the environment in its world-model architecture. We prove that, so long as the neural networks used by MuZero are equivariant to a particular symmetry group acting on the environment, the entirety of MuZero's action-selection algorithm will also be equivariant to that group.
arXiv Detail & Related papers (2023-02-09T17:46:29Z)
DITTO: Offline Imitation Learning with World Models [21.419536711242962]
DITTO is an offline imitation learning algorithm which addresses all three of these problems. We optimize this multi-step latent divergence using standard reinforcement learning algorithms. Our results show how creative use of world models can lead to a simple, robust, and highly-performant policy-learning framework.
arXiv Detail & Related papers (2023-02-06T19:41:18Z)
Efficient Offline Policy Optimization with a Learned Model [83.64779942889916]
MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline settings.
arXiv Detail & Related papers (2022-10-12T07:41:04Z)
Transformers are Sample Efficient World Models [1.9444242128493845]
We introduce IRIS, a data-efficient agent that learns in a world model composed of a discrete autoencoder and an autoregressive Transformer. With the equivalent of only two hours of gameplay in the Atari 100k benchmark, IRIS achieves a mean human normalized score of 1.046, and outperforms humans on 10 out of 26 games.
arXiv Detail & Related papers (2022-09-01T17:03:07Z)
Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage. We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets. By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z)
Meta-Learned Attribute Self-Gating for Continual Generalized Zero-Shot Learning [82.07273754143547]
We propose a meta-continual zero-shot learning (MCZSL) approach to generalizing a model to categories unseen during training. By pairing self-gating of attributes and scaled class normalization with meta-learning based training, we are able to outperform state-of-the-art results.
arXiv Detail & Related papers (2021-02-23T18:36:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.