Related papers: Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach

Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach

URL: http://arxiv.org/abs/2412.06684v2
Date: Sat, 14 Dec 2024 11:06:37 GMT
Title: Exploring Critical Testing Scenarios for Decision-Making Policies: An LLM Approach
Authors: Weichao Xu, Huaxin Pei, Jingxuan Yang, Yuchen Shi, Yi Zhang, Qianchuan Zhao,
Abstract summary: This paper proposes an adaptable Large Language Model (LLM)-driven online testing framework to explore critical and diverse testing scenarios.<n>Specifically, we design a "generate-test-feedback" pipeline with templated prompt engineering to harness the world knowledge and reasoning abilities of LLMs.
Score: 14.32199539218175
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in decision-making policies have led to significant progress in fields such as autonomous driving and robotics. However, testing these policies remains crucial with the existence of critical scenarios that may threaten their reliability. Despite ongoing research, challenges such as low testing efficiency and limited diversity persist due to the complexity of the decision-making policies and their environments. To address these challenges, this paper proposes an adaptable Large Language Model (LLM)-driven online testing framework to explore critical and diverse testing scenarios for decision-making policies. Specifically, we design a "generate-test-feedback" pipeline with templated prompt engineering to harness the world knowledge and reasoning abilities of LLMs. Additionally, a multi-scale scenario generation strategy is proposed to address the limitations of LLMs in making fine-grained adjustments, further enhancing testing efficiency. Finally, the proposed LLM-driven method is evaluated on five widely recognized benchmarks, and the experimental results demonstrate that our method significantly outperforms baseline methods in uncovering both critical and diverse scenarios. These findings suggest that LLM-driven methods hold significant promise for advancing the testing of decision-making policies.

Related papers

Enhancing Decision-Making of Large Language Models via Actor-Critic [28.870961806283425]
Large Language Models (LLMs) have achieved remarkable advancements in natural language processing tasks.<n>Existing methods either rely on short-term auto-regressive action generation or face limitations in accurately simulating rollouts and assessing outcomes.<n>This paper introduces a novel LLM-based Actor-Critic framework, termed LAC, that effectively improves LLM policies with long-term action evaluations.
arXiv Detail & Related papers (2025-06-04T14:58:27Z)
Robustness in Large Language Models: A Survey of Mitigation Strategies and Evaluation Metrics [0.7481505949203433]
Large Language Models (LLMs) have emerged as a promising cornerstone for the development of natural language processing (NLP) and artificial intelligence (AI)<n>This survey provides a comprehensive overview of current studies in this area.
arXiv Detail & Related papers (2025-05-24T11:50:52Z)
Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities [5.0778942095543576]
This paper introduces an adversarial evaluation framework designed to systematically stress-test the decision-making processes of Large Language Models.<n>We apply this framework to several state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini-1.5, and DeepSeek-V3.<n>Our findings highlight distinct behavioral patterns across models and emphasize the importance of adaptability and fairness recognition for trustworthy AI deployment.
arXiv Detail & Related papers (2025-05-19T14:50:44Z)
Reinforcement Learning with Continuous Actions Under Unmeasured Confounding [14.510042451844766]
This paper addresses the challenge of offline policy learning in reinforcement learning with continuous action spaces. We develop a minimax estimator and introduce a policy-gradient-based algorithm to identify the in-class optimal policy. We provide theoretical results regarding the consistency, finite-sample error bound, and regret bound of the resulting optimal policy.
arXiv Detail & Related papers (2025-05-01T04:55:29Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise. We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
Navigating the Risks: A Survey of Security, Privacy, and Ethics Threats in LLM-Based Agents [67.07177243654485]
This survey collects and analyzes the different threats faced by large language models-based agents. We identify six key features of LLM-based agents, based on which we summarize the current research progress. We select four representative agents as case studies to analyze the risks they may face in practical use.
arXiv Detail & Related papers (2024-11-14T15:40:04Z)
A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions [15.350973327319418]
Large language models (LLMs) are increasingly integrated into a wide range of everyday applications. This raises concerns about the replicability and generalizability of insights gained from research on LLM behavior. We tested GPT-3.5, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, Llama 3-8B, and Llama 3-70B, on the chain-of-thought, EmotionPrompting, ExpertPrompting, Sandbagging, as well as Re-Reading prompt engineering techniques.
arXiv Detail & Related papers (2024-09-30T14:00:34Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification (UQ) is a critical component of machine learning (ML) applications. We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines. We conduct a large-scale empirical investigation of UQ and normalization techniques across nine tasks, and identify the most promising approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making. We present a process-based benchmark MR-Ben that demands a meta-reasoning skill. Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Meta Reasoning for Large Language Models [58.87183757029041]
We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs) MRP guides LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task. We evaluate the effectiveness of MRP through comprehensive benchmarks.
arXiv Detail & Related papers (2024-06-17T16:14:11Z)
Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms [34.593772931446125]
monograph focuses on the exploration of various model-based and model-free approaches for Constrained within the context of average reward Markov Decision Processes (MDPs) The primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs.
arXiv Detail & Related papers (2024-06-17T12:46:02Z)
Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward [9.218557081971708]
Large Language Models (LLMs) have seen widespread applications across numerous fields. Their limited interpretability poses concerns regarding their safe operations from multiple aspects. Recent research has started developing quality assurance methods for LLMs.
arXiv Detail & Related papers (2024-04-12T14:55:16Z)
K-Level Reasoning: Establishing Higher Order Beliefs in Large Language Models for Strategic Reasoning [76.3114831562989]
It requires Large Language Model (LLM) agents to adapt their strategies dynamically in multi-agent environments. We propose a novel framework: "K-Level Reasoning with Large Language Models (K-R)"
arXiv Detail & Related papers (2024-02-02T16:07:05Z)
Put Your Money Where Your Mouth Is: Evaluating Strategic Planning and Execution of LLM Agents in an Auction Arena [25.865825113847404]
We introduce AucArena, a novel evaluation suite that simulates auctions. We conduct controlled experiments using state-of-the-art Large Language Models (LLMs) to power bidding agents to benchmark their planning and execution skills.
arXiv Detail & Related papers (2023-10-09T14:22:09Z)
Offline Reinforcement Learning with Instrumental Variables in Confounded Markov Decision Processes [93.61202366677526]
We study the offline reinforcement learning (RL) in the face of unmeasured confounders. We propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy.
arXiv Detail & Related papers (2022-09-18T22:03:55Z)
An Empirical Comparison of Bias Reduction Methods on Real-World Problems in High-Stakes Policy Settings [13.037143215464132]
We investigate the performance of several methods that operate at different points in the machine learning pipeline across four real-world public policy and social good problems. We find a wide degree of variability and inconsistency in the ability of many of these methods to improve model fairness, but post-processing by choosing group-specific score thresholds consistently removes disparities.
arXiv Detail & Related papers (2021-05-13T17:33:28Z)
Benchmarks for Deep Off-Policy Evaluation [152.28569758144022]
We present a collection of policies that can be used for benchmarking off-policy evaluation. The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles. We provide open-source access to our data and code to foster future research in this area.
arXiv Detail & Related papers (2021-03-30T18:09:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.