Related papers: Beyond Value: CHECKLIST for Testing Inferences in Planning-Based RL

Beyond Value: CHECKLIST for Testing Inferences in Planning-Based RL

URL: http://arxiv.org/abs/2206.02039v2
Date: Tue, 7 Jun 2022 20:41:50 GMT
Title: Beyond Value: CHECKLIST for Testing Inferences in Planning-Based RL
Authors: Kin-Ho Lam, Delyar Tabatabai, Jed Irvine, Donald Bertucci, Anita Ruangrotsakun, Minsuk Kahng, Alan Fern
Abstract summary: Reinforcement learning (RL) agents are commonly evaluated via their expected value over a distribution of test scenarios. We consider testing RL agents that make decisions via online tree search using a learned transition model and value function. We present a user study involving knowledgeable AI researchers using the approach to evaluate an agent trained to play a complex real-time strategy game.
Score: 20.360392791376707
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Reinforcement learning (RL) agents are commonly evaluated via their expected value over a distribution of test scenarios. Unfortunately, this evaluation approach provides limited evidence for post-deployment generalization beyond the test distribution. In this paper, we address this limitation by extending the recent CheckList testing methodology from natural language processing to planning-based RL. Specifically, we consider testing RL agents that make decisions via online tree search using a learned transition model and value function. The key idea is to improve the assessment of future performance via a CheckList approach for exploring and assessing the agent's inferences during tree search. The approach provides the user with an interface and general query-rule mechanism for identifying potential inference flaws and validating expected inference invariances. We present a user study involving knowledgeable AI researchers using the approach to evaluate an agent trained to play a complex real-time strategy game. The results show the approach is effective in allowing users to identify previously-unknown flaws in the agent's reasoning. In addition, our analysis provides insight into how AI experts use this type of testing approach, which may help improve future instantiations.

Related papers

RAVine: Reality-Aligned Evaluation for Agentic Search [7.4420114967110385]
RAVine is a Reality-Aligned eValuation framework for agentic LLMs with search.<n> RAVine targets multi-point queries and long-form answers that better reflect user intents.<n>We benchmark a series of models using RAVine and derive several insights.
arXiv Detail & Related papers (2025-07-22T16:08:12Z)
Hierarchical Reinforcement Learning for Temporal Abstraction of Listwise Recommendation [51.06031200728449]
We propose a novel framework called mccHRL to provide different levels of temporal abstraction on listwise recommendation. Within the hierarchical framework, the high-level agent studies the evolution of user perception, while the low-level agent produces the item selection policy. Results observe significant performance improvement by our method, compared with several well-known baselines.
arXiv Detail & Related papers (2024-09-11T17:01:06Z)
Token-Supervised Value Models for Enhancing Mathematical Problem-Solving Capabilities of Large Language Models [56.32800938317095]
Existing verifiers are sub-optimal for tree search techniques at test time.<n>We propose token-supervised value models (TVMs)<n>TVMs assign each token a probability that reflects the likelihood of reaching the correct final answer.
arXiv Detail & Related papers (2024-07-12T13:16:50Z)
Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning [53.241569810013836]
We propose a novel framework that utilizes large language models (LLMs) to identify effective feature generation rules. We use decision trees to convey this reasoning information, as they can be easily represented in natural language. OCTree consistently enhances the performance of various prediction models across diverse benchmarks.
arXiv Detail & Related papers (2024-06-12T08:31:34Z)
Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attribution Methods [49.62131719441252]
Attribution methods compute importance scores for input features to explain the output predictions of deep models. In this work, we first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill. We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.
arXiv Detail & Related papers (2024-05-02T13:48:37Z)
Sample Complexity of Preference-Based Nonparametric Off-Policy Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it. By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z)
Unveiling the Sentinels: Assessing AI Performance in Cybersecurity Peer Review [4.081120388114928]
In the field of cybersecurity, the practice of double-blind peer review is the de-facto standard. This paper touches on the holy grail of peer reviewing and aims to shed light on the performance of AI in reviewing for academic security conferences. We investigate the predictability of reviewing outcomes by comparing the results obtained from human reviewers and machine-learning models.
arXiv Detail & Related papers (2023-09-11T13:51:40Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Cross-functional Analysis of Generalisation in Behavioural Learning [4.0810783261728565]
We introduce BeLUGA, an analysis method for evaluating behavioural learning considering generalisation across dimensions of different levels. An aggregate score measures generalisation to unseen functionalities (or overfitting)
arXiv Detail & Related papers (2023-05-22T11:54:19Z)
RACCER: Towards Reachable and Certain Counterfactual Explanations for Reinforcement Learning [2.0341936392563063]
We propose RACCER, the first-specific approach to generating counterfactual explanations for the behavior of RL agents. We use a tree search to find the most suitable counterfactuals based on the defined properties. We evaluate RACCER in two tasks as well as conduct a user study to show that RL-specific counterfactuals help users better understand agents' behavior.
arXiv Detail & Related papers (2023-03-08T09:47:00Z)
A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification [0.491574468325115]
We present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation.
arXiv Detail & Related papers (2022-11-28T12:25:27Z)
Evaluating Explainable Methods for Predictive Process Analytics: A Functionally-Grounded Approach [2.2448567386846916]
Predictive process analytics focuses on predicting the future states of running instances of a business process. Current explainable machine learning methods, such as LIME and SHAP, can be used to interpret black box models. We apply the proposed metrics to evaluate the performance of LIME and SHAP in interpreting process predictive models built on XGBoost.
arXiv Detail & Related papers (2020-12-08T05:05:19Z)
Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education. Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding. We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.