Related papers: Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction

URL: http://arxiv.org/abs/2107.01715v1
Date: Sun, 4 Jul 2021 19:32:24 GMT
Title: Improve Agents without Retraining: Parallel Tree Search with Off-Policy Correction
Authors: Assaf Hallak and Gal Dalal, Steven Dalton, Iuri Frosio, Shie Mannor, Gal Chechik
Abstract summary: We tackle two major challenges with Tree Search (TS) We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent. We introduce Batch-BFS: a GPU breadth-first search that advances all nodes in each depth of the tree simultaneously.
Score: 63.595545216327245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tree Search (TS) is crucial to some of the most influential successes in reinforcement learning. Here, we tackle two major challenges with TS that limit its usability: \textit{distribution shift} and \textit{scalability}. We first discover and analyze a counter-intuitive phenomenon: action selection through TS and a pre-trained value function often leads to lower performance compared to the original pre-trained agent, even when having access to the exact state and reward in future steps. We show this is due to a distribution shift to areas where value estimates are highly inaccurate and analyze this effect using Extreme Value theory. To overcome this problem, we introduce a novel off-policy correction term that accounts for the mismatch between the pre-trained value and its corresponding TS policy by penalizing under-sampled trajectories. We prove that our correction eliminates the above mismatch and bound the probability of sub-optimal action selection. Our correction significantly improves pre-trained Rainbow agents without any further training, often more than doubling their scores on Atari games. Next, we address the scalability issue given by the computational complexity of exhaustive TS that scales exponentially with the tree depth. We introduce Batch-BFS: a GPU breadth-first search that advances all nodes in each depth of the tree simultaneously. Batch-BFS reduces runtime by two orders of magnitude and, beyond inference, enables also training with TS of depths that were not feasible before. We train DQN agents from scratch using TS and show improvement in several Atari games compared to both the original DQN and the more advanced Rainbow.

Related papers

ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [50.45155830888697]
We develop a reinforced self-training approach, called ReST-MCTS*, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines such as Best-of-N and Tree-of-Thought, within the same search budget.
arXiv Detail & Related papers (2024-06-06T07:40:00Z)
Dissecting Deep RL with High Update Ratios: Combatting Value Divergence [21.282292112642747]
We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters. We employ a simple unit-ball normalization that enables learning under large update ratios.
arXiv Detail & Related papers (2024-03-09T19:56:40Z)
Efficient local linearity regularization to overcome catastrophic overfitting [59.463867084204566]
Catastrophic overfitting (CO) in single-step adversarial training results in abrupt drops in the adversarial test accuracy (even down to 0%) We introduce a regularization term, called ELLE, to mitigate CO effectively and efficiently in classical AT evaluations.
arXiv Detail & Related papers (2024-01-21T22:55:26Z)
Test-Time Adaptation via Conjugate Pseudo-labels [21.005027151753477]
Test-time adaptation (TTA) refers to adapting neural networks to distribution shifts. Prior TTA methods optimize over unsupervised objectives such as the entropy of model predictions in TENT. We present a surprising phenomenon: if we attempt to meta-learn the best possible TTA loss over a wide class of functions, then we recover a function that is remarkably similar to (a temperature-scaled version of) the softmax-entropy employed by TENT.
arXiv Detail & Related papers (2022-07-20T04:02:19Z)
Retrieval-Augmented Reinforcement Learning [63.32076191982944]
We train a network to map a dataset of past experiences to optimal behavior. The retrieval process is trained to retrieve information from the dataset that may be useful in the current context. We show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores.
arXiv Detail & Related papers (2022-02-17T02:44:05Z)
Hindsight Experience Replay with Kronecker Product Approximate Curvature [5.441932327359051]
Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks. But due to its reduced sample efficiency and slower convergence HER fails to perform effectively. Natural gradients solves these challenges by converging the model parameters better. Our proposed method solves the above mentioned challenges with better sample efficiency and faster convergence with increased success rate.
arXiv Detail & Related papers (2020-10-09T20:25:14Z)
Munchausen Reinforcement Learning [50.396037940989146]
bootstrapping is a core mechanism in Reinforcement Learning (RL) We show that slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with distributional methods on Atari games. We provide strong theoretical insights on what happens under the hood -- implicit Kullback-Leibler regularization and increase of the action-gap.
arXiv Detail & Related papers (2020-07-28T18:30:23Z)
DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction [96.90215318875859]
We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback. We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
arXiv Detail & Related papers (2020-03-16T16:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.