Related papers: How Hard is it to Confuse a World Model?

How Hard is it to Confuse a World Model?

URL: http://arxiv.org/abs/2510.21232v1
Date: Fri, 24 Oct 2025 08:08:12 GMT
Title: How Hard is it to Confuse a World Model?
Authors: Waris Radji, Odalric-Ambrym Maillard,
Abstract summary: In reinforcement learning theory, the concept of most confusing instances is central to establishing regret lower bounds.<n>We formalize this problem for neural network world models as a constrained optimization.<n>Our results suggest that the degree of achievable confusion correlates with uncertainty in the approximate model.
Score: 6.896797484250302
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In reinforcement learning (RL) theory, the concept of most confusing instances is central to establishing regret lower bounds, that is, the minimal exploration needed to solve a problem. Given a reference model and its optimal policy, a most confusing instance is the statistically closest alternative model that makes a suboptimal policy optimal. While this concept is well-studied in multi-armed bandits and ergodic tabular Markov decision processes, constructing such instances remains an open question in the general case. In this paper, we formalize this problem for neural network world models as a constrained optimization: finding a modified model that is statistically close to the reference one, while producing divergent performance between optimal and suboptimal policies. We propose an adversarial training procedure to solve this problem and conduct an empirical study across world models of varying quality. Our results suggest that the degree of achievable confusion correlates with uncertainty in the approximate model, which may inform theoretically-grounded exploration strategies for deep model-based RL.

Related papers

Model Merging via Multi-Teacher Knowledge Distillation [11.543771846135021]
We introduce a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting.<n>We frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data.<n>We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk.
arXiv Detail & Related papers (2025-12-24T17:10:44Z)
Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems [53.03951222945921]
We analyze smoothed (perturbed) policies, adding controlled random perturbations to the direction used by the linear oracle.<n>Our main contribution is a generalization bound that decomposes the excess risk into perturbation bias, statistical estimation error, and optimization error.<n>We illustrate the scope of the results on applications such as vehicle scheduling, highlighting how smoothing enables both tractable training and controlled generalization.
arXiv Detail & Related papers (2024-07-24T12:00:30Z)
Model-Free Active Exploration in Reinforcement Learning [53.786439742572995]
We study the problem of exploration in Reinforcement Learning and present a novel model-free solution. Our strategy is able to identify efficient policies faster than state-of-the-art exploration approaches.
arXiv Detail & Related papers (2024-06-30T19:00:49Z)
Optimal Multi-Distribution Learning [88.3008613028333]
Multi-distribution learning seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions.<n>We propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon2.
arXiv Detail & Related papers (2023-12-08T16:06:29Z)
Pretty darn good control: when are approximate solutions better than approximate models [0.0]
We show that DRL algorithms can successfully approximate solutions in a non-linear three-variable model for a fishery. We show that the policy obtained with DRL is both more profitable and more sustainable than any constant mortality policy.
arXiv Detail & Related papers (2023-08-25T19:58:17Z)
Oracle Inequalities for Model Selection in Offline Reinforcement Learning [105.74139523696284]
We study the problem of model selection in offline RL with value function approximation. We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal inequalities up to logarithmic factors. We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.
arXiv Detail & Related papers (2022-11-03T17:32:34Z)
When to Update Your Model: Constrained Model-based Reinforcement Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL) Our follow-up derived bounds reveal the relationship between model shifts and performance improvement. A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z)
Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning [21.931580762349096]
We introduce an algorithm that computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model. We prove an information-theoretic, Bayesian regret bound for our algorithm that holds for any finite-horizon, episodic sequential decision-making problem.
arXiv Detail & Related papers (2022-06-04T23:36:38Z)
Sample Complexity of Robust Reinforcement Learning with a Generative Model [0.0]
We propose a model-based reinforcement learning (RL) algorithm for learning an $epsilon$-optimal robust policy. We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence. In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies.
arXiv Detail & Related papers (2021-12-02T18:55:51Z)
A bandit-learning approach to multifidelity approximation [7.960229223744695]
Multifidelity approximation is an important technique in scientific computation and simulation. We introduce a bandit-learning approach for leveraging data of varying fidelities to achieve precise estimates.
arXiv Detail & Related papers (2021-03-29T05:29:35Z)
COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions. We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.