Improve Agents without Retraining: Parallel Tree Search with Off-Policy
Correction
- URL: http://arxiv.org/abs/2107.01715v1
- Date: Sun, 4 Jul 2021 19:32:24 GMT
- Title: Improve Agents without Retraining: Parallel Tree Search with Off-Policy
Correction
- Authors: Assaf Hallak and Gal Dalal, Steven Dalton, Iuri Frosio, Shie Mannor,
Gal Chechik
- Abstract summary: We tackle two major challenges with Tree Search (TS)
We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent.
We introduce Batch-BFS: a GPU breadth-first search that advances all nodes in each depth of the tree simultaneously.
- Score: 63.595545216327245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tree Search (TS) is crucial to some of the most influential successes in
reinforcement learning. Here, we tackle two major challenges with TS that limit
its usability: \textit{distribution shift} and \textit{scalability}. We first
discover and analyze a counter-intuitive phenomenon: action selection through
TS and a pre-trained value function often leads to lower performance compared
to the original pre-trained agent, even when having access to the exact state
and reward in future steps. We show this is due to a distribution shift to
areas where value estimates are highly inaccurate and analyze this effect using
Extreme Value theory. To overcome this problem, we introduce a novel off-policy
correction term that accounts for the mismatch between the pre-trained value
and its corresponding TS policy by penalizing under-sampled trajectories. We
prove that our correction eliminates the above mismatch and bound the
probability of sub-optimal action selection. Our correction significantly
improves pre-trained Rainbow agents without any further training, often more
than doubling their scores on Atari games. Next, we address the scalability
issue given by the computational complexity of exhaustive TS that scales
exponentially with the tree depth. We introduce Batch-BFS: a GPU breadth-first
search that advances all nodes in each depth of the tree simultaneously.
Batch-BFS reduces runtime by two orders of magnitude and, beyond inference,
enables also training with TS of depths that were not feasible before. We train
DQN agents from scratch using TS and show improvement in several Atari games
compared to both the original DQN and the more advanced Rainbow.
Related papers
- ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [50.45155830888697]
We develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models.
We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget.
arXiv Detail & Related papers (2024-06-06T07:40:00Z) - Dissecting Deep RL with High Update Ratios: Combatting Value Divergence [21.282292112642747]
We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters.
We employ a simple unit-ball normalization that enables learning under large update ratios.
arXiv Detail & Related papers (2024-03-09T19:56:40Z) - Efficient local linearity regularization to overcome catastrophic
overfitting [59.463867084204566]
Catastrophic overfitting (CO) in single-step adversarial training results in abrupt drops in the adversarial test accuracy (even down to 0%)
We introduce a regularization term, called ELLE, to mitigate CO effectively and efficiently in classical AT evaluations.
arXiv Detail & Related papers (2024-01-21T22:55:26Z) - Test-Time Adaptation via Conjugate Pseudo-labels [21.005027151753477]
Test-time adaptation (TTA) refers to adapting neural networks to distribution shifts.
Prior TTA methods optimize over unsupervised objectives such as the entropy of model predictions in TENT.
We present a surprising phenomenon: if we attempt to meta-learn the best possible TTA loss over a wide class of functions, then we recover a function that is remarkably similar to (a temperature-scaled version of) the softmax-entropy employed by TENT.
arXiv Detail & Related papers (2022-07-20T04:02:19Z) - Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior.
The retrieval process is trained to retrieve information from the dataset that may be useful in the current context.
We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z) - Hindsight Experience Replay with Kronecker Product Approximate Curvature [5.441932327359051]
Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks.
But due to its reduced sample efficiency and slower convergence HER fails to perform effectively.
Natural gradients solves these challenges by converging the model parameters better.
Our proposed method solves the above mentioned challenges with better sample efficiency and faster convergence with increased success rate.
arXiv Detail & Related papers (2020-10-09T20:25:14Z) - Munchausen Reinforcement Learning [50.396037940989146]
bootstrapping is a core mechanism in Reinforcement Learning (RL)
We show that slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with distributional methods on Atari games.
We provide strong theoretical insights on what happens under the hood -- implicit Kullback-Leibler regularization and increase of the action-gap.
arXiv Detail & Related papers (2020-07-28T18:30:23Z) - DisCor: Corrective Feedback in Reinforcement Learning via Distribution
Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.