Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach
- URL: http://arxiv.org/abs/2412.06684v2
- Date: Sat, 14 Dec 2024 11:06:37 GMT
- Title: Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach
- Authors: Weichao Xu, Huaxin Pei, Jingxuan Yang, Yuchen Shi, Yi Zhang, Qianchuan Zhao,
- Abstract summary: This paper proposes an adaptable Large Language Model (LLM)-driven online testing framework to explore critical and diverse testing scenarios.
Specifically, we design a "generate-test-feedback" pipeline with templated prompt engineering to harness the world knowledge and reasoning abilities of LLMs.
- Score: 14.32199539218175
- License:
- Abstract: Recent advances in decision-making policies have led to significant progress in fields such as autonomous driving and robotics. However, testing these policies remains crucial with the existence of critical scenarios that may threaten their reliability. Despite ongoing research, challenges such as low testing efficiency and limited diversity persist due to the complexity of the decision-making policies and their environments. To address these challenges, this paper proposes an adaptable Large Language Model (LLM)-driven online testing framework to explore critical and diverse testing scenarios for decision-making policies. Specifically, we design a "generate-test-feedback" pipeline with templated prompt engineering to harness the world knowledge and reasoning abilities of LLMs. Additionally, a multi-scale scenario generation strategy is proposed to address the limitations of LLMs in making fine-grained adjustments, further enhancing testing efficiency. Finally, the proposed LLM-driven method is evaluated on five widely recognized benchmarks, and the experimental results demonstrate that our method significantly outperforms baseline methods in uncovering both critical and diverse scenarios. These findings suggest that LLM-driven methods hold significant promise for advancing the testing of decision-making policies.
Related papers
- Uncertainty Quantification and Causal Considerations for Off-Policy Decision Making [4.514386953429771]
Off-policy evaluation (OPE) seeks to assess the performance of a new policy using data collected under a different policy.
Existing OPE methodologies suffer from several limitations arising from statistical uncertainty as well as causal considerations.
We introduce the Marginal Ratio (MR) estimator, a novel OPE method that reduces variance by focusing on the marginal distribution of outcomes.
Next, we propose Conformal Off-Policy Prediction (COPP), a principled approach for uncertainty quantification in OPE.
Finally, we address causal unidentifiability in off-policy decision-making by developing novel bounds for sequential decision settings
arXiv Detail & Related papers (2025-02-09T20:05:19Z) - Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents [67.07177243654485]
This survey collects and analyzes the different threats faced by large language models-based agents.
We identify six key features of LLM-based agents, based on which we summarize the current research progress.
We select four representative agents as case studies to analyze the risks they may face in practical use.
arXiv Detail & Related papers (2024-11-14T15:40:04Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Meta Reasoning for Large Language Models [58.87183757029041]
We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs)
MRP guides LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task.
We evaluate the effectiveness of MRP through comprehensive benchmarks.
arXiv Detail & Related papers (2024-06-17T16:14:11Z) - Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms [34.593772931446125]
monograph focuses on the exploration of various model-based and model-free approaches for Constrained within the context of average reward Markov Decision Processes (MDPs)
The primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs.
arXiv Detail & Related papers (2024-06-17T12:46:02Z) - Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward [9.218557081971708]
Large Language Models (LLMs) have seen widespread applications across numerous fields.
Their limited interpretability poses concerns regarding their safe operations from multiple aspects.
Recent research has started developing quality assurance methods for LLMs.
arXiv Detail & Related papers (2024-04-12T14:55:16Z) - Automatically Correcting Large Language Models: Surveying the landscape
of diverse self-correction strategies [104.32199881187607]
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks.
A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output.
This paper presents a comprehensive review of this emerging class of techniques.
arXiv Detail & Related papers (2023-08-06T18:38:52Z) - Offline Reinforcement Learning with Instrumental Variables in Confounded
Markov Decision Processes [93.61202366677526]
We study the offline reinforcement learning (RL) in the face of unmeasured confounders.
We propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy.
arXiv Detail & Related papers (2022-09-18T22:03:55Z) - An Empirical Comparison of Bias Reduction Methods on Real-World Problems
in High-Stakes Policy Settings [13.037143215464132]
We investigate the performance of several methods that operate at different points in the machine learning pipeline across four real-world public policy and social good problems.
We find a wide degree of variability and inconsistency in the ability of many of these methods to improve model fairness, but post-processing by choosing group-specific score thresholds consistently removes disparities.
arXiv Detail & Related papers (2021-05-13T17:33:28Z) - Benchmarks for Deep Off-Policy Evaluation [152.28569758144022]
We present a collection of policies that can be used for benchmarking off-policy evaluation.
The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles.
We provide open-source access to our data and code to foster future research in this area.
arXiv Detail & Related papers (2021-03-30T18:09:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.