Dream to Generalize: Zero-Shot Model-Based Reinforcement Learning for Unseen Visual Distractions
- URL: http://arxiv.org/abs/2506.05419v1
- Date: Thu, 05 Jun 2025 00:39:03 GMT
- Title: Dream to Generalize: Zero-Shot Model-Based Reinforcement Learning for Unseen Visual Distractions
- Authors: Jeongsoo Ha, Kyungsoo Kim, Yusung Kim,
- Abstract summary: We propose a novel self-supervised method, Dream to Generalize (Dr. G), for zero-shot model-based reinforcement learning (MBRL)<n>Dr. G trains its encoder and world model with dual contrastive learning which efficiently captures task-relevant features among multi-view data augmentations.<n>We also introduce a recurrent state inverse dynamics model that helps the world model to better understand the temporal structure.
- Score: 14.137070712516005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model-based reinforcement learning (MBRL) has been used to efficiently solve vision-based control tasks in highdimensional image observations. Although recent MBRL algorithms perform well in trained observations, they fail when faced with visual distractions in observations. These task-irrelevant distractions (e.g., clouds, shadows, and light) may be constantly present in real-world scenarios. In this study, we propose a novel self-supervised method, Dream to Generalize (Dr. G), for zero-shot MBRL. Dr. G trains its encoder and world model with dual contrastive learning which efficiently captures task-relevant features among multi-view data augmentations. We also introduce a recurrent state inverse dynamics model that helps the world model to better understand the temporal structure. The proposed methods can enhance the robustness of the world model against visual distractions. To evaluate the generalization performance, we first train Dr. G on simple backgrounds and then test it on complex natural video backgrounds in the DeepMind Control suite, and the randomizing environments in Robosuite. Dr. G yields a performance improvement of 117% and 14% over prior works, respectively. Our code is open-sourced and available at https://github.com/JeongsooHa/DrG.git
Related papers
- Task-aligned prompting improves zero-shot detection of AI-generated images by Vision-Language Models [2.005104318774207]
In this work, we investigate the use of pre-trained Vision-Language Models for zero-shot detection of AI-generated images.<n>We show that task-aligned prompting elicits more focused reasoning and significantly improves performance without fine-tuning.<n>Our findings show that task-aligned prompts elicit more focused reasoning and enhance latent capabilities in VLMs.
arXiv Detail & Related papers (2025-05-20T22:44:04Z) - VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model [29.524164786422368]
Recently, DeepSeek R1 has shown that reinforcement learning can substantially improve the reasoning capabilities of Large Language Models (LLMs)<n>We investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs)<n>We develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks.
arXiv Detail & Related papers (2025-04-10T10:05:15Z) - Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion [36.321494200830244]
Copilot4D is a novel world modeling approach that first tokenizes sensor observations with VQVAE, then predicts the future via discrete diffusion.
Our results demonstrate that discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics.
arXiv Detail & Related papers (2023-11-02T06:21:56Z) - HarmonyDream: Task Harmonization Inside World Models [93.07314830304193]
Model-based reinforcement learning (MBRL) holds the promise of sample-efficient learning.
We propose a simple yet effective approach, HarmonyDream, which automatically adjusts loss coefficients to maintain task harmonization.
arXiv Detail & Related papers (2023-09-30T11:38:13Z) - DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion
Models [97.31200133440308]
We propose using online reinforcement learning to fine-tune text-to-image models.
We focus on diffusion models, defining the fine-tuning task as an RL problem.
Our approach, coined DPOK, integrates policy optimization with KL regularization.
arXiv Detail & Related papers (2023-05-25T17:35:38Z) - Predictive Experience Replay for Continual Visual Control and
Forecasting [62.06183102362871]
We present a new continual learning approach for visual dynamics modeling and explore its efficacy in visual control and forecasting.
We first propose the mixture world model that learns task-specific dynamics priors with a mixture of Gaussians, and then introduce a new training strategy to overcome catastrophic forgetting.
Our model remarkably outperforms the naive combinations of existing continual learning and visual RL algorithms on DeepMind Control and Meta-World benchmarks with continual visual control tasks.
arXiv Detail & Related papers (2023-03-12T05:08:03Z) - EfficientTrain: Exploring Generalized Curriculum Learning for Training
Visual Backbones [80.662250618795]
This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers)
As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models by >1.5x on ImageNet-1K/22K without sacrificing accuracy.
arXiv Detail & Related papers (2022-11-17T17:38:55Z) - Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment.
We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent.
We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z) - Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z) - DreamerPro: Reconstruction-Free Model-Based Reinforcement Learning with
Prototypical Representations [18.770113681323906]
Top-performing Model-Based Reinforcement Learning (MBRL) agents, such as Dreamer, learn the world model by reconstructing the image observations.
We propose to learn the prototypes from the recurrent states of the world model, thereby distilling temporal structures from past observations and actions into prototypes.
The resulting model, DreamerPro, successfully combines Dreamer with prototypes, making large performance gains on the DeepMind Control suite.
arXiv Detail & Related papers (2021-10-27T16:35:00Z) - Contrastive Variational Reinforcement Learning for Complex Observations [39.98639686743489]
This paper presents Contrastive Variational Reinforcement Learning (CVRL), a model-based method that tackles complex visual observations in DRL.
CVRL learns a contrastive variational model by maximizing the mutual information between latent states and observations discriminatively.
It achieves comparable performance with state-of-the-art model-based DRL methods on standard Mujoco tasks.
arXiv Detail & Related papers (2020-08-06T02:25:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.