WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents
- URL: http://arxiv.org/abs/2510.01354v1
- Date: Wed, 01 Oct 2025 18:34:06 GMT
- Title: WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents
- Authors: Yinuo Liu, Ruohan Xu, Xilong Wang, Yuqi Jia, Neil Zhenqiang Gong,
- Abstract summary: We present the first comprehensive benchmark study on detecting prompt injection attacks targeting web agents.<n>We construct datasets containing both malicious and benign samples.<n>We then systematize both text-based and image-based detection methods.
- Score: 34.909802797979324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multiple prompt injection attacks have been proposed against web agents. At the same time, various methods have been developed to detect general prompt injection attacks, but none have been systematically evaluated for web agents. In this work, we bridge this gap by presenting the first comprehensive benchmark study on detecting prompt injection attacks targeting web agents. We begin by introducing a fine-grained categorization of such attacks based on the threat model. We then construct datasets containing both malicious and benign samples: malicious text segments generated by different attacks, benign text segments from four categories, malicious images produced by attacks, and benign images from two categories. Next, we systematize both text-based and image-based detection methods. Finally, we evaluate their performance across multiple scenarios. Our key findings show that while some detectors can identify attacks that rely on explicit textual instructions or visible image perturbations with moderate to high accuracy, they largely fail against attacks that omit explicit instructions or employ imperceptible perturbations. Our datasets and code are released at: https://github.com/Norrrrrrr-lyn/WAInjectBench.
Related papers
- WebSentinel: Detecting and Localizing Prompt Injection Attacks for Web Agents [45.87204751555924]
Prompt injection attacks manipulate webpage content to cause web agents to execute attacker-specified tasks instead of the user's intended ones.<n>Existing methods for detecting and localizing such attacks achieve limited effectiveness.<n>We propose WebSentinel, a two-step approach for detecting and localizing prompt injection attacks in webpages.
arXiv Detail & Related papers (2026-02-03T17:55:04Z) - SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models [11.304852987259041]
This paper introduces three novel contextually-aware attack scenarios that exploit domain-specific knowledge and semantic plausibility.<n>We present textbfSCOUT (Saliency-based Classification Of Untrusted Tokens), a novel defense framework that identifies backdoor triggers through token-level saliency analysis.
arXiv Detail & Related papers (2025-12-10T17:25:55Z) - Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning [58.16354555208417]
PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively.<n>The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors.<n>We present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces.
arXiv Detail & Related papers (2025-05-19T16:35:45Z) - Can Indirect Prompt Injection Attacks Be Detected and Removed? [68.6543680065379]
We investigate the feasibility of detecting and removing indirect prompt injection attacks.<n>For detection, we assess the performance of existing LLMs and open-source detection models.<n>For removal, we evaluate two intuitive methods: (1) the segmentation removal method, which segments the injected document and removes parts containing injected instructions, and (2) the extraction removal method, which trains an extraction model to identify and remove injected instructions.
arXiv Detail & Related papers (2025-02-23T14:02:16Z) - AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning [93.77763753231338]
Adversarial Contrastive Prompt Tuning (ACPT) is proposed to fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries.
We show that ACPT can detect 7 state-of-the-art query-based attacks with $>99%$ detection rate within 5 shots.
We also show that ACPT is robust to 3 types of adaptive attacks.
arXiv Detail & Related papers (2024-08-04T09:53:50Z) - Towards a Practical Defense against Adversarial Attacks on Deep
Learning-based Malware Detectors via Randomized Smoothing [3.736916304884177]
We propose a practical defense against adversarial malware examples inspired by randomized smoothing.
In our work, instead of employing Gaussian or Laplace noise when randomizing inputs, we propose a randomized ablation-based smoothing scheme.
We have empirically evaluated the proposed ablation-based model against various state-of-the-art evasion attacks on the BODMAS dataset.
arXiv Detail & Related papers (2023-08-17T10:30:25Z) - Uncertainty-based Detection of Adversarial Attacks in Semantic
Segmentation [16.109860499330562]
We introduce an uncertainty-based approach for the detection of adversarial attacks in semantic segmentation.
We demonstrate the ability of our approach to detect perturbed images across multiple types of adversarial attacks.
arXiv Detail & Related papers (2023-05-22T08:36:35Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Zero-Query Transfer Attacks on Context-Aware Object Detectors [95.18656036716972]
Adversarial attacks perturb images such that a deep neural network produces incorrect classification results.
A promising approach to defend against adversarial attacks on natural multi-object scenes is to impose a context-consistency check.
We present the first approach for generating context-consistent adversarial attacks that can evade the context-consistency check.
arXiv Detail & Related papers (2022-03-29T04:33:06Z) - AntidoteRT: Run-time Detection and Correction of Poison Attacks on
Neural Networks [18.461079157949698]
backdoor poisoning attacks against image classification networks.
We propose lightweight automated detection and correction techniques against poisoning attacks.
Our technique outperforms existing defenses such as NeuralCleanse and STRIP on popular benchmarks.
arXiv Detail & Related papers (2022-01-31T23:42:32Z) - Anomaly Detection-Based Unknown Face Presentation Attack Detection [74.4918294453537]
Anomaly detection-based spoof attack detection is a recent development in face Presentation Attack Detection.
In this paper, we present a deep-learning solution for anomaly detection-based spoof attack detection.
The proposed approach benefits from the representation learning power of the CNNs and learns better features for fPAD task.
arXiv Detail & Related papers (2020-07-11T21:20:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.