Related papers: Diffusion for World Modeling: Visual Details Matter in Atari

Diffusion for World Modeling: Visual Details Matter in Atari

URL: http://arxiv.org/abs/2405.12399v1
Date: Mon, 20 May 2024 22:51:05 GMT
Title: Diffusion for World Modeling: Visual Details Matter in Atari
Authors: Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, François Fleuret,
Abstract summary: We introduce DIAMOND (DIffusion As a Model Of eNvironment Dreams), a reinforcement learning agent trained in a diffusion world model. We analyze the key design choices that are required to make diffusion suitable for world modeling, and demonstrate how improved visual details can lead to improved agent performance.
Score: 22.915802013352465
License: http://creativecommons.org/licenses/by/4.0/
Abstract: World models constitute a promising approach for training reinforcement learning agents in a safe and sample-efficient manner. Recent world models predominantly operate on sequences of discrete latent variables to model environment dynamics. However, this compression into a compact discrete representation may ignore visual details that are important for reinforcement learning. Concurrently, diffusion models have become a dominant approach for image generation, challenging well-established methods modeling discrete latents. Motivated by this paradigm shift, we introduce DIAMOND (DIffusion As a Model Of eNvironment Dreams), a reinforcement learning agent trained in a diffusion world model. We analyze the key design choices that are required to make diffusion suitable for world modeling, and demonstrate how improved visual details can lead to improved agent performance. DIAMOND achieves a mean human normalized score of 1.46 on the competitive Atari 100k benchmark; a new best for agents trained entirely within a world model. To foster future research on diffusion for world modeling, we release our code, agents and playable world models at https://github.com/eloialonso/diamond.

Related papers

Bootstrapping World Models from Dynamics Models in Multimodal Foundation Models [37.774994737939394]
We use dynamics models to bootstrap world models using synthetic data and inference time verification.<n>Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of $15%$ on real-world subsets according to GPT4o-as-judge.
arXiv Detail & Related papers (2025-06-06T11:50:18Z)
Vid2World: Crafting Video Diffusion Models to Interactive World Models [38.270098691244314]
Vid2World is a general approach for leveraging and transferring pre-trained video diffusion models into interactive world models.<n>It performs casualization of a pre-trained video diffusion model by crafting its architecture and training objective to enable autoregressive generation.<n>It introduces a causal action guidance mechanism to enhance action controllability in the resulting interactive world model.
arXiv Detail & Related papers (2025-05-20T13:41:45Z)
Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning [93.58897637077001]
This paper tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. We pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model.
arXiv Detail & Related papers (2025-03-11T13:50:22Z)
USP: Unified Self-Supervised Pretraining for Image Generation and Understanding [15.717333276867462]
Unified Self-supervised Pretraining (USP) is a framework that initializes diffusion models via masked latent modeling in a Variational Autoencoder (VAE) latent space. USP achieves comparable performance in understanding tasks while significantly improving the convergence speed and generation quality of diffusion models.
arXiv Detail & Related papers (2025-03-08T09:01:03Z)
The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation [53.837937703425794]
LanDiff is a hybrid framework that synergizes the strengths of autoregressive language models and diffusion models. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D representations through efficient semantic compression; (2) a language model that generates semantic tokens with high-level semantic relationships; and (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos.
arXiv Detail & Related papers (2025-03-06T16:53:14Z)
EDELINE: Enhancing Memory in Diffusion-based World Models via Linear-Time Sequence Modeling [8.250616459360684]
We introduce EDELINE, a unified world model architecture that integrates state space models with diffusion models.<n>Our approach outperforms existing baselines across visually challenging Atari 100k tasks, memory-demanding benchmark, and 3D first-person ViZDoom environments.
arXiv Detail & Related papers (2025-02-01T15:49:59Z)
DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On [103.89972383310715]
DiffusionTrend harnesses latent information rich in prior information to capture the nuances of garment details. It delivers a visually compelling try-on experience, underscoring the potential of training-free diffusion model.
arXiv Detail & Related papers (2024-12-19T02:24:35Z)
Scaling Diffusion Language Models via Adaptation from Autoregressive Models [105.70889434492143]
Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling. We show that we can convert AR models ranging from 127M to 7B parameters into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts.
arXiv Detail & Related papers (2024-10-23T14:04:22Z)
AVID: Adapting Video Diffusion Models to World Models [10.757223474031248]
We propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.
arXiv Detail & Related papers (2024-10-01T13:48:31Z)
FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets. We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation. Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z)
Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation [59.184980778643464]
Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI) In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion) Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment.
arXiv Detail & Related papers (2024-02-15T18:59:18Z)
Guided Diffusion from Self-Supervised Diffusion Features [49.78673164423208]
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or pretraining. We propose a framework to extract guidance from, and specifically for, diffusion models.
arXiv Detail & Related papers (2023-12-14T11:19:11Z)
SODA: Bottleneck Diffusion Models for Representation Learning [75.7331354734152]
We introduce SODA, a self-supervised diffusion model, designed for representation learning. The model incorporates an image encoder, which distills a source view into a compact representation, that guides the generation of related novel views. We show that by imposing a tight bottleneck between the encoder and a denoising decoder, we can turn diffusion models into strong representation learners.
arXiv Detail & Related papers (2023-11-29T18:53:34Z)
Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion [36.321494200830244]
Copilot4D is a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion. Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.
arXiv Detail & Related papers (2023-11-02T06:21:56Z)
HarmonyDream: Task Harmonization Inside World Models [93.07314830304193]
Model-based reinforcement learning (MBRL) holds the promise of sample-efficient learning. We propose a simple yet effective approach, HarmonyDream, which automatically adjusts loss coefficients to maintain task harmonization.
arXiv Detail & Related papers (2023-09-30T11:38:13Z)
Masked Diffusion Models Are Fast Distribution Learners [32.485235866596064]
Diffusion models are commonly trained to learn all fine-grained visual information from scratch. We show that it suffices to train a strong diffusion model by first pre-training the model to learn some primer distribution. Then the pre-trained model can be fine-tuned for various generation tasks efficiently.
arXiv Detail & Related papers (2023-06-20T08:02:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.