Related papers: Adaptive Stress Testing Black-Box LLM Planners

Adaptive Stress Testing Black-Box LLM Planners

URL: http://arxiv.org/abs/2505.05665v2
Date: Fri, 10 Oct 2025 19:42:23 GMT
Title: Adaptive Stress Testing Black-Box LLM Planners
Authors: Neeloy Chakraborty, John Pohovey, Melkior Ornik, Katherine Driggs-Campbell,
Abstract summary: Large language models (LLMs) have recently demonstrated success in generalizing across decision-making tasks.<n>But their tendency to hallucinate unsafe and undesired outputs poses risks.<n>We argue that detecting such failures is necessary, especially in safety-critical scenarios.
Score: 6.506759042895813
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) have recently demonstrated success in generalizing across decision-making tasks including planning, control, and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. We argue that detecting such failures is necessary, especially in safety-critical scenarios. Existing methods for black-box models often detect hallucinations by identifying inconsistencies across multiple samples. Many of these approaches typically introduce prompt perturbations like randomizing detail order or generating adversarial inputs, with the intuition that a confident model should produce stable outputs. We first perform a manual case study showing that other forms of perturbations (e.g., adding noise, removing sensor details) cause LLMs to hallucinate in a multi-agent driving environment. We then propose a novel method for efficiently searching the space of prompt perturbations using adaptive stress testing (AST) with Monte-Carlo tree search (MCTS). Our AST formulation enables discovery of scenarios and prompts that cause language models to act with high uncertainty or even crash. By generating MCTS prompt perturbation trees across diverse scenarios, we show through extensive experiments that offline analyses can be used at runtime to automatically generate prompts that influence model uncertainty, and to inform real-time trust assessments of an LLM. We further characterize LLMs deployed as planners in a single-agent lunar lander environment and in a multi-agent robot crowd navigation simulation. Overall, ours is one of the first hallucination intervention algorithms to pave a path towards rigorous characterization of black-box LLM planners.

Related papers

Constitutional Black-Box Monitoring for Scheming in LLM Agents [1.4619913143519836]
We use language models to examine agent behaviors for suspicious actions.<n>We study constitutional black-box monitors that detect scheming using only externally observable inputs and outputs.<n>We find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization.
arXiv Detail & Related papers (2026-02-28T22:31:32Z)
Mitigating Hallucinations in Multimodal LLMs via Object-aware Preference Optimization [55.543583937522804]
Multimodal Large Language Models (MLLMs) emerge as a unified interface to address a multitude of tasks.<n>Despite showcasing state-of-the-art results in many benchmarks, a long-standing issue is the tendency of MLLMs to hallucinate.<n>In this paper, we address the problem of hallucinations as an alignment problem, seeking to steer the MLLM so that it prefers generating content without hallucinations.
arXiv Detail & Related papers (2025-08-27T18:02:04Z)
Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [78.09559830840595]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation.
arXiv Detail & Related papers (2025-08-20T17:59:51Z)
Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models [0.0]
We propose Counterfactual Probing, a novel approach for detecting and mitigating hallucinations in large language models.<n>Our method dynamically generates counterfactual statements that appear plausible but contain subtle factual errors, then evaluates the model's sensitivity to these perturbations.
arXiv Detail & Related papers (2025-08-03T17:29:48Z)
MIRAGE-Bench: LLM Agent is Hallucinating and Where to Find Them [52.764019220214344]
Hallucinations pose critical risks for large language model (LLM)-based agents.<n>We present MIRAGE-Bench, the first unified benchmark for eliciting and evaluating hallucinations in interactive environments.
arXiv Detail & Related papers (2025-07-28T17:38:29Z)
Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks [0.5218155982819203]
Large Language Models (LLMs) are increasingly used as code assistants.<n>This study examines a more direct threat: open-source LLMs generating vulnerable code when prompted.
arXiv Detail & Related papers (2025-07-14T08:36:26Z)
Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [62.579951798437115]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z)
Hidden in Plain Sight: Probing Implicit Reasoning in Multimodal Language Models [21.698247799954654]
Multimodal large language models (MLLMs) are increasingly deployed in open-ended, real-world environments.<n>This paper presents a systematic analysis of how current MLLMs handle implicit reasoning scenarios.<n>We find that models frequently fail to surface hidden issues, even when they possess the necessary perceptual and reasoning skills.
arXiv Detail & Related papers (2025-05-30T21:47:28Z)
Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation? [7.416552590139255]
We evaluate a suite of open-access LLMs on their ability to detect intrinsic hallucinations in two conditional generation tasks.<n>We study how model performance varies across tasks and language.<n>We find that performance varies across models but is consistent across prompts.
arXiv Detail & Related papers (2025-04-29T12:30:05Z)
Feature-Aware Malicious Output Detection and Mitigation [8.378272216429954]
We propose a feature-aware method for harmful response rejection (FMM)<n>FMM detects the presence of malicious features within the model's feature space and adaptively adjusts the model's rejection mechanism.<n> Experimental results demonstrate the effectiveness of our approach across multiple language models and diverse attack techniques.
arXiv Detail & Related papers (2025-04-12T12:12:51Z)
Towards General Visual-Linguistic Face Forgery Detection(V2) [90.6600794602029]
Face manipulation techniques have achieved significant advances, presenting serious challenges to security and social trust.<n>Recent works demonstrate that leveraging multimodal models can enhance the generalization and interpretability of face forgery detection.<n>We propose Face Forgery Text Generator (FFTG), a novel annotation pipeline that generates accurate text descriptions by leveraging forgery masks for initial region and type identification.
arXiv Detail & Related papers (2025-02-28T04:15:36Z)
HuDEx: Integrating Hallucination Detection and Explainability for Enhancing the Reliability of LLM responses [0.12499537119440242]
This paper proposes an explanation enhanced hallucination-detection model, coined as HuDEx.<n>The proposed model provides a novel approach to integrate detection with explanations, and enable both users and the LLM itself to understand and reduce errors.
arXiv Detail & Related papers (2025-02-12T04:17:02Z)
Large Models in Dialogue for Active Perception and Anomaly Detection [35.16837804526144]
We propose a framework to actively collect information and perform anomaly detection in novel scenes.<n>Two deep learning models engage in a dialogue to actively control a drone to increase perception and anomaly detection accuracy.<n>In addition to information gathering, our approach is utilized for anomaly detection and our results demonstrate the proposed methods effectiveness.
arXiv Detail & Related papers (2025-01-27T18:38:36Z)
Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Ambiguous Prompts and Unanswerable Questions [60.31496362993982]
Large language models (LLMs) frequently generate confident yet inaccurate responses.<n>We present a novel, test-time approach to detecting model hallucination through systematic analysis of information flow.
arXiv Detail & Related papers (2024-12-13T16:14:49Z)
LLMScan: Causal Scan for LLM Misbehavior Detection [12.411972858200594]
Large Language Models (LLMs) generate untruthful, biased and harmful responses.<n>This work introduces LLMScan, an innovative monitoring technique based on causality analysis.
arXiv Detail & Related papers (2024-10-22T02:27:57Z)
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. Our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z)
Real-Time Anomaly Detection and Reactive Planning with Large Language Models [18.57162998677491]
Foundation models, e.g., large language models (LLMs), trained on internet-scale data possess zero-shot capabilities. We present a two-stage reasoning framework that incorporates the judgement regarding potential anomalies into a safe control framework. This enables our monitor to improve the trustworthiness of dynamic robotic systems, such as quadrotors or autonomous vehicles.
arXiv Detail & Related papers (2024-07-11T17:59:22Z)
Unfamiliar Finetuning Examples Control How Language Models Hallucinate [75.03210107477157]
Large language models are known to hallucinate when faced with unfamiliar queries. We find that unfamiliar examples in the models' finetuning data are crucial in shaping these errors. Our work further investigates RL finetuning strategies for improving the factuality of long-form model generations.
arXiv Detail & Related papers (2024-03-08T18:28:13Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Can Autonomous Vehicles Identify, Recover From, and Adapt to Distribution Shifts? [104.04999499189402]
Out-of-training-distribution (OOD) scenarios are a common challenge of learning agents at deployment. We propose an uncertainty-aware planning method, called emphrobust imitative planning (RIP) Our method can detect and recover from some distribution shifts, reducing the overconfident and catastrophic extrapolations in OOD scenes. We introduce an autonomous car novel-scene benchmark, textttCARNOVEL, to evaluate the robustness of driving agents to a suite of tasks with distribution shifts.
arXiv Detail & Related papers (2020-06-26T11:07:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.