Beyond Value: CHECKLIST for Testing Inferences in Planning-Based RL
- URL: http://arxiv.org/abs/2206.02039v2
- Date: Tue, 7 Jun 2022 20:41:50 GMT
- Title: Beyond Value: CHECKLIST for Testing Inferences in Planning-Based RL
- Authors: Kin-Ho Lam, Delyar Tabatabai, Jed Irvine, Donald Bertucci, Anita
Ruangrotsakun, Minsuk Kahng, Alan Fern
- Abstract summary: Reinforcement learning (RL) agents are commonly evaluated via their expected value over a distribution of test scenarios.
We consider testing RL agents that make decisions via online tree search using a learned transition model and value function.
We present a user study involving knowledgeable AI researchers using the approach to evaluate an agent trained to play a complex real-time strategy game.
- Score: 20.360392791376707
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Reinforcement learning (RL) agents are commonly evaluated via their expected
value over a distribution of test scenarios. Unfortunately, this evaluation
approach provides limited evidence for post-deployment generalization beyond
the test distribution. In this paper, we address this limitation by extending
the recent CheckList testing methodology from natural language processing to
planning-based RL. Specifically, we consider testing RL agents that make
decisions via online tree search using a learned transition model and value
function. The key idea is to improve the assessment of future performance via a
CheckList approach for exploring and assessing the agent's inferences during
tree search. The approach provides the user with an interface and general
query-rule mechanism for identifying potential inference flaws and validating
expected inference invariances. We present a user study involving knowledgeable
AI researchers using the approach to evaluate an agent trained to play a
complex real-time strategy game. The results show the approach is effective in
allowing users to identify previously-unknown flaws in the agent's reasoning.
In addition, our analysis provides insight into how AI experts use this type of
testing approach, which may help improve future instantiations.
Related papers
- Hierarchical Reinforcement Learning for Temporal Abstraction of Listwise Recommendation [51.06031200728449]
We propose a novel framework called mccHRL to provide different levels of temporal abstraction on listwise recommendation.
Within the hierarchical framework, the high-level agent studies the evolution of user perception, while the low-level agent produces the item selection policy.
Results observe significant performance improvement by our method, compared with several well-known baselines.
arXiv Detail & Related papers (2024-09-11T17:01:06Z) - Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning [53.241569810013836]
We propose a novel framework that utilizes large language models (LLMs) to identify effective feature generation rules.
We use decision trees to convey this reasoning information, as they can be easily represented in natural language.
OCTree consistently enhances the performance of various prediction models across diverse benchmarks.
arXiv Detail & Related papers (2024-06-12T08:31:34Z) - Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attribution Methods [49.62131719441252]
Attribution methods compute importance scores for input features to explain the output predictions of deep models.
In this work, we first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.
We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.
arXiv Detail & Related papers (2024-05-02T13:48:37Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z) - Unveiling the Sentinels: Assessing AI Performance in Cybersecurity Peer
Review [4.081120388114928]
In the field of cybersecurity, the practice of double-blind peer review is the de-facto standard.
This paper touches on the holy grail of peer reviewing and aims to shed light on the performance of AI in reviewing for academic security conferences.
We investigate the predictability of reviewing outcomes by comparing the results obtained from human reviewers and machine-learning models.
arXiv Detail & Related papers (2023-09-11T13:51:40Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Cross-functional Analysis of Generalisation in Behavioural Learning [4.0810783261728565]
We introduce BeLUGA, an analysis method for evaluating behavioural learning considering generalisation across dimensions of different levels.
An aggregate score measures generalisation to unseen functionalities (or overfitting)
arXiv Detail & Related papers (2023-05-22T11:54:19Z) - RACCER: Towards Reachable and Certain Counterfactual Explanations for
Reinforcement Learning [2.0341936392563063]
We propose RACCER, the first-specific approach to generating counterfactual explanations for the behavior of RL agents.
We use a tree search to find the most suitable counterfactuals based on the defined properties.
We evaluate RACCER in two tasks as well as conduct a user study to show that RL-specific counterfactuals help users better understand agents' behavior.
arXiv Detail & Related papers (2023-03-08T09:47:00Z) - A Call to Reflect on Evaluation Practices for Failure Detection in Image
Classification [0.491574468325115]
We present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions.
The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation.
arXiv Detail & Related papers (2022-11-28T12:25:27Z) - Evaluating Explainable Methods for Predictive Process Analytics: A
Functionally-Grounded Approach [2.2448567386846916]
Predictive process analytics focuses on predicting the future states of running instances of a business process.
Current explainable machine learning methods, such as LIME and SHAP, can be used to interpret black box models.
We apply the proposed metrics to evaluate the performance of LIME and SHAP in interpreting process predictive models built on XGBoost.
arXiv Detail & Related papers (2020-12-08T05:05:19Z) - Interpretable Off-Policy Evaluation in Reinforcement Learning by
Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education.
Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding.
We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.