Related papers: Decision Making in Non-Stationary Environments with Policy-Augmented Search

Decision Making in Non-Stationary Environments with Policy-Augmented Search

URL: http://arxiv.org/abs/2401.03197v2
Date: Sat, 20 Jan 2024 18:34:03 GMT
Title: Decision Making in Non-Stationary Environments with Policy-Augmented Search
Authors: Ava Pettet, Yunuo Zhang, Baiting Luo, Kyle Wray, Hendrik Baier, Aron Laszka, Abhishek Dubey, Ayan Mukhopadhyay
Abstract summary: We introduce textitPolicy-Augmented Monte Carlo tree search (PA-MCTS) It combines action-value estimates from an out-of-date policy with an online search using an up-to-date model of the environment. We prove theoretical results showing conditions under which PA-MCTS selects the one-step optimal action and also bound the error accrued while following PA-MCTS as a policy.
Score: 9.000981144624507
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Sequential decision-making under uncertainty is present in many important problems. Two popular approaches for tackling such problems are reinforcement learning and online search (e.g., Monte Carlo tree search). While the former learns a policy by interacting with the environment (typically done before execution), the latter uses a generative model of the environment to sample promising action trajectories at decision time. Decision-making is particularly challenging in non-stationary environments, where the environment in which an agent operates can change over time. Both approaches have shortcomings in such settings -- on the one hand, policies learned before execution become stale when the environment changes and relearning takes both time and computational effort. Online search, on the other hand, can return sub-optimal actions when there are limitations on allowed runtime. In this paper, we introduce \textit{Policy-Augmented Monte Carlo tree search} (PA-MCTS), which combines action-value estimates from an out-of-date policy with an online search using an up-to-date model of the environment. We prove theoretical results showing conditions under which PA-MCTS selects the one-step optimal action and also bound the error accrued while following PA-MCTS as a policy. We compare and contrast our approach with AlphaZero, another hybrid planning approach, and Deep Q Learning on several OpenAI Gym environments. Through extensive experiments, we show that under non-stationary settings with limited time constraints, PA-MCTS outperforms these baselines.

Related papers

OMPO: A Unified Framework for RL under Policy and Dynamics Shifts [42.57662196581823]
Training reinforcement learning policies using environment interaction data collected from varying policies or dynamics presents a fundamental challenge. Existing works often overlook the distribution discrepancies induced by policy or dynamics shifts, or rely on specialized algorithms with task priors. In this paper, we identify a unified strategy for online RL policy learning under diverse settings of policy and dynamics shifts: transition occupancy matching.
arXiv Detail & Related papers (2024-05-29T13:36:36Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Act as You Learn: Adaptive Decision-Making in Non-Stationary Markov Decision Processes [5.276882857467777]
We present a search algorithm called textitAdaptive Monte Carlo Tree Search (ADA-MCTS) We show that the agent can learn the updated dynamics of the environment over time and then act as it learns, i.e., if the agent is in a region of the state space about which it has updated knowledge, it can avoid being pessimistic.
arXiv Detail & Related papers (2024-01-03T17:19:54Z)
AI planning in the imagination: High-level planning on learned abstract search spaces [68.75684174531962]
We propose a new method, called PiZero, that gives an agent the ability to plan in an abstract search space that the agent learns during training. We evaluate our method on multiple domains, including the traveling salesman problem, Sokoban, 2048, the facility location problem, and Pacman.
arXiv Detail & Related papers (2023-08-16T22:47:16Z)
A Surprisingly Simple Continuous-Action POMDP Solver: Lazy Cross-Entropy Search Over Policy Trees [5.250288418639076]
We propose an online POMDP solver called Lazy Cross-Entropy Search Over Policy Trees (LCEOPT) At each planning step, our method uses a novel lazy Cross-Entropy method to search the space of policy trees. Our method is surprisingly simple as compared to existing state-of-the-art methods, yet empirically outperforms them on several continuous-action POMDP problems.
arXiv Detail & Related papers (2023-05-14T03:12:53Z)
Learning Logic Specifications for Soft Policy Guidance in POMCP [71.69251176275638]
Partially Observable Monte Carlo Planning (POMCP) is an efficient solver for Partially Observable Markov Decision Processes (POMDPs) POMCP suffers from sparse reward function, namely, rewards achieved only when the final goal is reached. In this paper, we use inductive logic programming to learn logic specifications from traces of POMCP executions.
arXiv Detail & Related papers (2023-03-16T09:37:10Z)
Dichotomy of Control: Separating What You Can Control from What You Cannot [129.62135987416164]
We propose a future-conditioned supervised learning framework that separates mechanisms within a policy's control (actions) from those beyond a policy's control (environmentity) We show that DoC yields policies that are consistent with their conditioning inputs, ensuring that conditioning a learned policy on a desired high-return future outcome will correctly induce high-return behavior.
arXiv Detail & Related papers (2022-10-24T17:49:56Z)
A State-Distribution Matching Approach to Non-Episodic Reinforcement Learning [61.406020873047794]
A major hurdle to real-world application arises from the development of algorithms in an episodic setting. We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations. Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
arXiv Detail & Related papers (2022-05-11T00:06:29Z)
Decision Making in Non-Stationary Environments with Policy-Augmented Monte Carlo Tree Search [2.20439695290991]
Decision-making under uncertainty (DMU) is present in many important problems. An open challenge is DMU in non-stationary environments, where the dynamics of the environment can change over time. We present a novel hybrid decision-making approach that combines the strengths of RL and planning while mitigating their weaknesses.
arXiv Detail & Related papers (2022-02-25T22:31:37Z)
Deep Reinforcement Learning amidst Lifelong Non-Stationarity [67.24635298387624]
We show that an off-policy RL algorithm can reason about and tackle lifelong non-stationarity. Our method leverages latent variable models to learn a representation of the environment from current and past experiences. We also introduce several simulation environments that exhibit lifelong non-stationarity, and empirically find that our approach substantially outperforms approaches that do not reason about environment shift.
arXiv Detail & Related papers (2020-06-18T17:34:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.