Decision Making in Non-Stationary Environments with Policy-Augmented
Search
- URL: http://arxiv.org/abs/2401.03197v2
- Date: Sat, 20 Jan 2024 18:34:03 GMT
- Title: Decision Making in Non-Stationary Environments with Policy-Augmented
Search
- Authors: Ava Pettet, Yunuo Zhang, Baiting Luo, Kyle Wray, Hendrik Baier, Aron
Laszka, Abhishek Dubey, Ayan Mukhopadhyay
- Abstract summary: We introduce textitPolicy-Augmented Monte Carlo tree search (PA-MCTS)
It combines action-value estimates from an out-of-date policy with an online search using an up-to-date model of the environment.
We prove theoretical results showing conditions under which PA-MCTS selects the one-step optimal action and also bound the error accrued while following PA-MCTS as a policy.
- Score: 9.000981144624507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequential decision-making under uncertainty is present in many important
problems. Two popular approaches for tackling such problems are reinforcement
learning and online search (e.g., Monte Carlo tree search). While the former
learns a policy by interacting with the environment (typically done before
execution), the latter uses a generative model of the environment to sample
promising action trajectories at decision time. Decision-making is particularly
challenging in non-stationary environments, where the environment in which an
agent operates can change over time. Both approaches have shortcomings in such
settings -- on the one hand, policies learned before execution become stale
when the environment changes and relearning takes both time and computational
effort. Online search, on the other hand, can return sub-optimal actions when
there are limitations on allowed runtime. In this paper, we introduce
\textit{Policy-Augmented Monte Carlo tree search} (PA-MCTS), which combines
action-value estimates from an out-of-date policy with an online search using
an up-to-date model of the environment. We prove theoretical results showing
conditions under which PA-MCTS selects the one-step optimal action and also
bound the error accrued while following PA-MCTS as a policy. We compare and
contrast our approach with AlphaZero, another hybrid planning approach, and
Deep Q Learning on several OpenAI Gym environments. Through extensive
experiments, we show that under non-stationary settings with limited time
constraints, PA-MCTS outperforms these baselines.
Related papers
- OMPO: A Unified Framework for RL under Policy and Dynamics Shifts [42.57662196581823]
Training reinforcement learning policies using environment interaction data collected from varying policies or dynamics presents a fundamental challenge.
Existing works often overlook the distribution discrepancies induced by policy or dynamics shifts, or rely on specialized algorithms with task priors.
In this paper, we identify a unified strategy for online RL policy learning under diverse settings of policy and dynamics shifts: transition occupancy matching.
arXiv Detail & Related papers (2024-05-29T13:36:36Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Act as You Learn: Adaptive Decision-Making in Non-Stationary Markov
Decision Processes [5.276882857467777]
We present a search algorithm called textitAdaptive Monte Carlo Tree Search (ADA-MCTS)
We show that the agent can learn the updated dynamics of the environment over time and then act as it learns, i.e., if the agent is in a region of the state space about which it has updated knowledge, it can avoid being pessimistic.
arXiv Detail & Related papers (2024-01-03T17:19:54Z) - AI planning in the imagination: High-level planning on learned abstract
search spaces [68.75684174531962]
We propose a new method, called PiZero, that gives an agent the ability to plan in an abstract search space that the agent learns during training.
We evaluate our method on multiple domains, including the traveling salesman problem, Sokoban, 2048, the facility location problem, and Pacman.
arXiv Detail & Related papers (2023-08-16T22:47:16Z) - A Surprisingly Simple Continuous-Action POMDP Solver: Lazy Cross-Entropy
Search Over Policy Trees [5.250288418639076]
We propose an online POMDP solver called Lazy Cross-Entropy Search Over Policy Trees (LCEOPT)
At each planning step, our method uses a novel lazy Cross-Entropy method to search the space of policy trees.
Our method is surprisingly simple as compared to existing state-of-the-art methods, yet empirically outperforms them on several continuous-action POMDP problems.
arXiv Detail & Related papers (2023-05-14T03:12:53Z) - Learning Logic Specifications for Soft Policy Guidance in POMCP [71.69251176275638]
Partially Observable Monte Carlo Planning (POMCP) is an efficient solver for Partially Observable Markov Decision Processes (POMDPs)
POMCP suffers from sparse reward function, namely, rewards achieved only when the final goal is reached.
In this paper, we use inductive logic programming to learn logic specifications from traces of POMCP executions.
arXiv Detail & Related papers (2023-03-16T09:37:10Z) - Dichotomy of Control: Separating What You Can Control from What You
Cannot [129.62135987416164]
We propose a future-conditioned supervised learning framework that separates mechanisms within a policy's control (actions) from those beyond a policy's control (environmentity)
We show that DoC yields policies that are consistent with their conditioning inputs, ensuring that conditioning a learned policy on a desired high-return future outcome will correctly induce high-return behavior.
arXiv Detail & Related papers (2022-10-24T17:49:56Z) - A State-Distribution Matching Approach to Non-Episodic Reinforcement
Learning [61.406020873047794]
A major hurdle to real-world application arises from the development of algorithms in an episodic setting.
We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations.
Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
arXiv Detail & Related papers (2022-05-11T00:06:29Z) - Decision Making in Non-Stationary Environments with Policy-Augmented
Monte Carlo Tree Search [2.20439695290991]
Decision-making under uncertainty (DMU) is present in many important problems.
An open challenge is DMU in non-stationary environments, where the dynamics of the environment can change over time.
We present a novel hybrid decision-making approach that combines the strengths of RL and planning while mitigating their weaknesses.
arXiv Detail & Related papers (2022-02-25T22:31:37Z) - Deep Reinforcement Learning amidst Lifelong Non-Stationarity [67.24635298387624]
We show that an off-policy RL algorithm can reason about and tackle lifelong non-stationarity.
Our method leverages latent variable models to learn a representation of the environment from current and past experiences.
We also introduce several simulation environments that exhibit lifelong non-stationarity, and empirically find that our approach substantially outperforms approaches that do not reason about environment shift.
arXiv Detail & Related papers (2020-06-18T17:34:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.