Decision Making in Non-Stationary Environments with Policy-Augmented
Monte Carlo Tree Search
- URL: http://arxiv.org/abs/2202.13003v1
- Date: Fri, 25 Feb 2022 22:31:37 GMT
- Title: Decision Making in Non-Stationary Environments with Policy-Augmented
Monte Carlo Tree Search
- Authors: Geoffrey Pettet, Ayan Mukhopadhyay, Abhishek Dubey
- Abstract summary: Decision-making under uncertainty (DMU) is present in many important problems.
An open challenge is DMU in non-stationary environments, where the dynamics of the environment can change over time.
We present a novel hybrid decision-making approach that combines the strengths of RL and planning while mitigating their weaknesses.
- Score: 2.20439695290991
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Decision-making under uncertainty (DMU) is present in many important
problems. An open challenge is DMU in non-stationary environments, where the
dynamics of the environment can change over time. Reinforcement Learning (RL),
a popular approach for DMU problems, learns a policy by interacting with a
model of the environment offline. Unfortunately, if the environment changes the
policy can become stale and take sub-optimal actions, and relearning the policy
for the updated environment takes time and computational effort. An alternative
is online planning approaches such as Monte Carlo Tree Search (MCTS), which
perform their computation at decision time. Given the current environment, MCTS
plans using high-fidelity models to determine promising action trajectories.
These models can be updated as soon as environmental changes are detected to
immediately incorporate them into decision making. However, MCTS's convergence
can be slow for domains with large state-action spaces. In this paper, we
present a novel hybrid decision-making approach that combines the strengths of
RL and planning while mitigating their weaknesses. Our approach, called Policy
Augmented MCTS (PA-MCTS), integrates a policy's actin-value estimates into
MCTS, using the estimates to seed the action trajectories favored by the
search. We hypothesize that PA-MCTS will converge more quickly than standard
MCTS while making better decisions than the policy can make on its own when
faced with nonstationary environments. We test our hypothesis by comparing
PA-MCTS with pure MCTS and an RL agent applied to the classical CartPole
environment. We find that PC-MCTS can achieve higher cumulative rewards than
the policy in isolation under several environmental shifts while converging in
significantly fewer iterations than pure MCTS.
Related papers
- Survival of the Fittest: Evolutionary Adaptation of Policies for Environmental Shifts [0.15889427269227555]
We develop an adaptive re-training algorithm inspired by evolutionary game theory (EGT)
ERPO shows faster policy adaptation, higher average rewards, and reduced computational costs in policy adaptation.
arXiv Detail & Related papers (2024-10-22T09:29:53Z) - Decision Making in Non-Stationary Environments with Policy-Augmented
Search [9.000981144624507]
We introduce textitPolicy-Augmented Monte Carlo tree search (PA-MCTS)
It combines action-value estimates from an out-of-date policy with an online search using an up-to-date model of the environment.
We prove theoretical results showing conditions under which PA-MCTS selects the one-step optimal action and also bound the error accrued while following PA-MCTS as a policy.
arXiv Detail & Related papers (2024-01-06T11:51:50Z) - Act as You Learn: Adaptive Decision-Making in Non-Stationary Markov
Decision Processes [5.276882857467777]
We present a search algorithm called textitAdaptive Monte Carlo Tree Search (ADA-MCTS)
We show that the agent can learn the updated dynamics of the environment over time and then act as it learns, i.e., if the agent is in a region of the state space about which it has updated knowledge, it can avoid being pessimistic.
arXiv Detail & Related papers (2024-01-03T17:19:54Z) - Robust Multi-Agent Reinforcement Learning via Adversarial
Regularization: Theoretical Foundation and Stable Algorithms [79.61176746380718]
Multi-Agent Reinforcement Learning (MARL) has shown promising results across several domains.
MARL policies often lack robustness and are sensitive to small changes in their environment.
We show that we can gain robustness by controlling a policy's Lipschitz constant.
We propose a new robust MARL framework, ERNIE, that promotes the Lipschitz continuity of the policies.
arXiv Detail & Related papers (2023-10-16T20:14:06Z) - Learning Logic Specifications for Soft Policy Guidance in POMCP [71.69251176275638]
Partially Observable Monte Carlo Planning (POMCP) is an efficient solver for Partially Observable Markov Decision Processes (POMDPs)
POMCP suffers from sparse reward function, namely, rewards achieved only when the final goal is reached.
In this paper, we use inductive logic programming to learn logic specifications from traces of POMCP executions.
arXiv Detail & Related papers (2023-03-16T09:37:10Z) - Hallucinated Adversarial Control for Conservative Offline Policy
Evaluation [64.94009515033984]
We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, we seek to obtain a (tight) lower bound on a policy's performance.
We introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics.
We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return.
arXiv Detail & Related papers (2023-03-02T08:57:35Z) - Policy Dispersion in Non-Markovian Environment [53.05904889617441]
This paper tries to learn the diverse policies from the history of state-action pairs under a non-Markovian environment.
We first adopt a transformer-based method to learn policy embeddings.
Then, we stack the policy embeddings to construct a dispersion matrix to induce a set of diverse policies.
arXiv Detail & Related papers (2023-02-28T11:58:39Z) - Dichotomy of Control: Separating What You Can Control from What You
Cannot [129.62135987416164]
We propose a future-conditioned supervised learning framework that separates mechanisms within a policy's control (actions) from those beyond a policy's control (environmentity)
We show that DoC yields policies that are consistent with their conditioning inputs, ensuring that conditioning a learned policy on a desired high-return future outcome will correctly induce high-return behavior.
arXiv Detail & Related papers (2022-10-24T17:49:56Z) - Modular Deep Reinforcement Learning for Continuous Motion Planning with
Temporal Logic [59.94347858883343]
This paper investigates the motion planning of autonomous dynamical systems modeled by Markov decision processes (MDP)
The novelty is to design an embedded product MDP (EP-MDP) between the LDGBA and the MDP.
The proposed LDGBA-based reward shaping and discounting schemes for the model-free reinforcement learning (RL) only depend on the EP-MDP states.
arXiv Detail & Related papers (2021-02-24T01:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.