Related papers: Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings

Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings

URL: http://arxiv.org/abs/2107.11003v1
Date: Fri, 23 Jul 2021 02:41:51 GMT
Title: Model Selection for Offline Reinforcement Learning: Practical Considerations for Healthcare Settings
Authors: Shengpu Tang, Jenna Wiens
Abstract summary: Reinforcement learning can be used to learn treatment policies and aid decision making in healthcare. A standard validation pipeline for model selection requires running a learned policy in the actual environment. Our work serves as a practical guide for offline RL model selection and can help RL practitioners select policies using real-world datasets.
Score: 13.376364233897528
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) can be used to learn treatment policies and aid decision making in healthcare. However, given the need for generalization over complex state/action spaces, the incorporation of function approximators (e.g., deep neural networks) requires model selection to reduce overfitting and improve policy performance at deployment. Yet a standard validation pipeline for model selection requires running a learned policy in the actual environment, which is often infeasible in a healthcare setting. In this work, we investigate a model selection pipeline for offline RL that relies on off-policy evaluation (OPE) as a proxy for validation performance. We present an in-depth analysis of popular OPE methods, highlighting the additional hyperparameters and computational requirements (fitting/inference of auxiliary models) when used to rank a set of candidate policies. We compare the utility of different OPE methods as part of the model selection pipeline in the context of learning to treat patients with sepsis. Among all the OPE methods we considered, fitted Q evaluation (FQE) consistently leads to the best validation ranking, but at a high computational cost. To balance this trade-off between accuracy of ranking and computational efficiency, we propose a simple two-stage approach to accelerate model selection by avoiding potentially unnecessary computation. Our work serves as a practical guide for offline RL model selection and can help RL practitioners select policies using real-world datasets. To facilitate reproducibility and future extensions, the code accompanying this paper is available online at https://github.com/MLD3/OfflineRL_ModelSelection.

Related papers

Behavior Preference Regression for Offline Reinforcement Learning [0.0]
offline reinforcement learning (RL) methods aim to learn optimal policies with access only to trajectories in a fixed dataset. Policy constraint methods formulate policy learning as an optimization problem that balances maximizing reward with minimizing deviation from the policy. We adapt this approach of paired comparison to Behavior Regression Preference for offline RL. We empirically evaluate BPR on the widely used D4RL Locomotion and Antmaze datasets, as well as the more challenging V-D4RL suite.
arXiv Detail & Related papers (2025-03-02T15:13:02Z)
Policy Trees for Prediction: Interpretable and Adaptive Model Selection for Machine Learning [5.877778007271621]
We introduce a tree-based approach, Optimal Predictive-Policy Trees (OP2T), that yields interpretable policies for adaptively selecting a predictive model or ensemble. Our approach enables interpretable and adaptive model selection and rejection while only assuming access to model outputs. We evaluate our approach on real-world datasets, including regression and classification tasks with both structured and unstructured data.
arXiv Detail & Related papers (2024-05-30T21:21:33Z)
Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation [37.36913210031282]
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering. We propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques.
arXiv Detail & Related papers (2024-05-29T01:49:20Z)
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models. There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z)
Reinforcement Learning with Human Feedback: Learning Dynamic Choices via Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF) We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices. RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z)
Deep Offline Reinforcement Learning for Real-world Treatment Optimization Applications [3.770564448216192]
We introduce a practical and theoretically grounded transition sampling approach to address action imbalance during offline RL training. We perform extensive experiments on two real-world tasks for diabetes and sepsis treatment optimization. Across a range of principled and clinically relevant metrics, we show that our proposed approach enables substantial improvements in expected health outcomes.
arXiv Detail & Related papers (2023-02-15T09:30:57Z)
Oracle Inequalities for Model Selection in Offline Reinforcement Learning [105.74139523696284]
We study the problem of model selection in offline RL with value function approximation. We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal inequalities up to logarithmic factors. We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.
arXiv Detail & Related papers (2022-11-03T17:32:34Z)
Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data [28.846826115837825]
offline reinforcement learning can be used to improve future performance by leveraging historical data. We introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy. We show it can have substantial impacts when the dataset is small.
arXiv Detail & Related papers (2022-10-16T21:24:53Z)
Testing Stationarity and Change Point Detection in Reinforcement Learning [10.343546104340962]
We develop a consistent procedure to test the nonstationarity of the optimal Q-function based on pre-collected historical data. We further develop a sequential change point detection method that can be naturally coupled with existing state-of-the-art RL methods for policy optimization in nonstationary environments.
arXiv Detail & Related papers (2022-03-03T13:30:28Z)
An Experimental Design Perspective on Model-Based Reinforcement Learning [73.37942845983417]
In practical applications of RL, it is expensive to observe state transitions from the environment. We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process.
arXiv Detail & Related papers (2021-12-09T23:13:57Z)
Pessimistic Model Selection for Offline Deep Reinforcement Learning [56.282483586473816]
Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. We propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee.
arXiv Detail & Related papers (2021-11-29T06:29:49Z)
B-Pref: Benchmarking Preference-Based Reinforcement Learning [84.41494283081326]
We introduce B-Pref, a benchmark specially designed for preference-based RL. A key challenge with such a benchmark is providing the ability to evaluate candidate algorithms quickly. B-Pref alleviates this by simulating teachers with a wide array of irrationalities.
arXiv Detail & Related papers (2021-11-04T17:32:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.