Related papers: SimuDICE: Offline Policy Optimization Through World Model Updates and DICE Estimation

SimuDICE: Offline Policy Optimization Through World Model Updates and DICE Estimation

URL: http://arxiv.org/abs/2412.06486v1
Date: Mon, 09 Dec 2024 13:35:46 GMT
Title: SimuDICE: Offline Policy Optimization Through World Model Updates and DICE Estimation
Authors: Catalin E. Brita, Stephan Bongers, Frans A. Oliehoek,
Abstract summary: In offline reinforcement learning, deriving an effective policy from a pre-collected set of experiences is challenging.<n>We introduce SimuDICE, a framework that iteratively refines the initial policy derived from offline data using synthetically generated experiences.<n>SimuDICE achieves performance comparable to existing algorithms while requiring fewer pre-collected experiences and planning steps.
Score: 11.030633145295385
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In offline reinforcement learning, deriving an effective policy from a pre-collected set of experiences is challenging due to the distribution mismatch between the target policy and the behavioral policy used to collect the data, as well as the limited sample size. Model-based reinforcement learning improves sample efficiency by generating simulated experiences using a learned dynamic model of the environment. However, these synthetic experiences often suffer from the same distribution mismatch. To address these challenges, we introduce SimuDICE, a framework that iteratively refines the initial policy derived from offline data using synthetically generated experiences from the world model. SimuDICE enhances the quality of these simulated experiences by adjusting the sampling probabilities of state-action pairs based on stationary DIstribution Correction Estimation (DICE) and the estimated confidence in the model's predictions. This approach guides policy improvement by balancing experiences similar to those frequently encountered with ones that have a distribution mismatch. Our experiments show that SimuDICE achieves performance comparable to existing algorithms while requiring fewer pre-collected experiences and planning steps, and it remains robust across varying data collection policies.

Related papers

Conditional Data Synthesis Augmentation [4.3108820946281945]
Conditional Data Synthesis Augmentation (CoDSA) is a novel framework that synthesizes high-fidelity data for improving model performance across multimodal domains. CoDSA fine-tunes pre-trained generative models to enhance the realism of synthetic data and increase sample density in sparse areas. We introduce a theoretical framework that quantifies the statistical accuracy improvements enabled by CoDSA as a function of synthetic sample volume and targeted region allocation.
arXiv Detail & Related papers (2025-04-10T03:38:11Z)
Model-Based Offline Reinforcement Learning with Adversarial Data Augmentation [36.9134885948595]
We introduce Model-based Offline Reinforcement learning with AdversariaL data augmentation. In MORAL, we replace the fixed horizon rollout by employing adversaria data augmentation to execute alternating sampling with ensemble models. Experiments on D4RL benchmark demonstrate that MORAL outperforms other model-based offline RL methods in terms of policy learning and sample efficiency.
arXiv Detail & Related papers (2025-03-26T07:24:34Z)
Testing Generalizability in Causal Inference [3.547529079746247]
There is no formal procedure for statistically evaluating generalizability in machine learning algorithms. We propose a systematic and quantitative framework for evaluating model generalizability in causal inference settings. By basing simulations on real data, our method ensures more realistic evaluations, which is often missing in current work.
arXiv Detail & Related papers (2024-11-05T11:44:00Z)
On conditional diffusion models for PDE simulations [53.01911265639582]
We study score-based diffusion models for forecasting and assimilation of sparse observations. We propose an autoregressive sampling approach that significantly improves performance in forecasting. We also propose a new training strategy for conditional score-based models that achieves stable performance over a range of history lengths.
arXiv Detail & Related papers (2024-10-21T18:31:04Z)
MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models. We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z)
COSBO: Conservative Offline Simulation-Based Policy Optimization [7.696359453385686]
offline reinforcement learning allows training reinforcement learning models on data from live deployments. In contrast, simulation environments attempting to replicate the live environment can be used instead of the live data. We propose a method that combines an imperfect simulation environment with data from the target environment, to train an offline reinforcement learning policy.
arXiv Detail & Related papers (2024-09-22T12:20:55Z)
SAMBO-RL: Shifts-aware Model-based Offline Reinforcement Learning [9.88109749688605]
Model-based offline reinforcement learning trains policies using pre-collected datasets and learned environment models. This paper offers a comprehensive analysis that disentangles the problem into two fundamental components: model bias and policy shift. We introduce Shifts-aware Model-based Offline Reinforcement Learning (SAMBO-RL), a practical framework that efficiently trains classifiers to approximate SAR for policy optimization.
arXiv Detail & Related papers (2024-08-23T04:25:09Z)
Combining Experimental and Historical Data for Policy Evaluation [17.89146022336492]
We propose novel data integration methods that linearly integrate base policy value estimators constructed based on the experimental and historical data. We derive their robustness, efficiency and properties across a broad spectrum of reward shift scenarios. Numerical experiments and real-data-based analyses from a ridesharing company demonstrate the superior performance of the proposed estimators.
arXiv Detail & Related papers (2024-06-01T06:26:28Z)
Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z)
Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes. A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z)
DEALIO: Data-Efficient Adversarial Learning for Imitation from Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator. Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms. This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk. We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z)
COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions. We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
Model-based Policy Optimization with Unsupervised Model Adaptation [37.09948645461043]
We investigate how to bridge the gap between real and simulated data due to inaccurate model estimation for better policy optimization. We propose a novel model-based reinforcement learning framework AMPO, which introduces unsupervised model adaptation. Our approach achieves state-of-the-art performance in terms of sample efficiency on a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2020-10-19T14:19:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.