Can Large Language Models Improve Phishing Defense? A Large-Scale Controlled Experiment on Warning Dialogue Explanations
- URL: http://arxiv.org/abs/2507.07916v1
- Date: Thu, 10 Jul 2025 16:54:05 GMT
- Title: Can Large Language Models Improve Phishing Defense? A Large-Scale Controlled Experiment on Warning Dialogue Explanations
- Authors: Federico Maria Cau, Giuseppe Desolda, Francesco Greco, Lucio Davide Spano, Luca ViganĂ²,
- Abstract summary: Phishing is a prominent risk in modern cybersecurity, often used to bypass technological defences by exploiting predictable human behaviour.<n>Warning dialogues are a standard mitigation measure, but the lack of explanatory clarity and static content limits their effectiveness.<n>We report on our research to assess the capacity of Large Language Models to generate clear, concise, and scalable explanations for phishing warnings.
- Score: 2.854118480747787
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Phishing has become a prominent risk in modern cybersecurity, often used to bypass technological defences by exploiting predictable human behaviour. Warning dialogues are a standard mitigation measure, but the lack of explanatory clarity and static content limits their effectiveness. In this paper, we report on our research to assess the capacity of Large Language Models (LLMs) to generate clear, concise, and scalable explanations for phishing warnings. We carried out a large-scale between-subjects user study (N = 750) to compare the influence of warning dialogues supplemented with manually generated explanations against those generated by two LLMs, Claude 3.5 Sonnet and Llama 3.3 70B. We investigated two explanatory styles (feature-based and counterfactual) for their effects on behavioural metrics (click-through rate) and perceptual outcomes (e.g., trust, risk, clarity). The results indicate that well-constructed LLM-generated explanations can equal or surpass manually crafted explanations in reducing susceptibility to phishing; Claude-generated warnings exhibited particularly robust performance. Feature-based explanations were more effective for genuine phishing attempts, whereas counterfactual explanations diminished false-positive rates. Other variables such as workload, gender, and prior familiarity with warning dialogues significantly moderated warning effectiveness. These results indicate that LLMs can be used to automatically build explanations for warning users against phishing, and that such solutions are scalable, adaptive, and consistent with human-centred values.
Related papers
- ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [49.47193675702453]
Large Language Models (LLMs) have demonstrated remarkable generative capabilities.<n>LLMs remain vulnerable to malicious instructions that can bypass safety constraints.<n>We propose a reasoning-based safety alignment framework, ARMOR, that replaces the ad-hoc chains of thought reasoning process with human-aligned, structured one.
arXiv Detail & Related papers (2025-07-14T09:05:54Z) - LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users [50.18141341939909]
We describe a vulnerability in language models trained with user feedback.<n>A single user can persistently alter LM knowledge and behavior.<n>We show that this attack can be used to insert factual knowledge the model did not previously possess.
arXiv Detail & Related papers (2025-07-03T17:55:40Z) - Evaluating Large Language Models for Phishing Detection, Self-Consistency, Faithfulness, and Explainability [0.0]
Large Language Models (LLMs) show a promising direction and potential for improving domain specific phishing classification tasks.<n>Can LLMs not only classify phishing emails accurately but also generate explanations that are reliably aligned with their predictions and internally self-consistent?<n>We have fine-tuned transformer based models, including BERT, Llama models, and Wizard, to improve domain relevance and make them more tailored to phishing specific distinctions.
arXiv Detail & Related papers (2025-06-16T17:54:28Z) - TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent [10.467098379826618]
We propose TrojanStego, a novel threat model in which an adversary fine-tunes an LLM to embed sensitive context information into natural-looking outputs via linguistic steganography.<n>We introduce a taxonomy outlining risk factors for compromised LLMs, and use it to evaluate the risk profile of the threat.<n> Experimental results show that compromised models reliably transmit 32-bit secrets with 87% accuracy on held-out prompts, reaching over 97% accuracy using majority voting across three generations.
arXiv Detail & Related papers (2025-05-26T15:20:51Z) - Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - "Explain, Don't Just Warn!" -- A Real-Time Framework for Generating Phishing Warnings with Contextual Cues [2.6818118216403497]
Anti-phishing tools typically display generic warnings that offer users limited explanation on why a website is considered malicious.<n>We present PhishXplain, a real-time explainable phishing warning system designed to augment existing detection mechanisms.
arXiv Detail & Related papers (2025-05-11T04:16:16Z) - EXPLICATE: Enhancing Phishing Detection through Explainable AI and LLM-Powered Interpretability [44.2907457629342]
EXPLICATE is a framework that enhances phishing detection through a three-component architecture.<n>It is on par with existing deep learning techniques but has better explainability.<n>It addresses the critical divide between automated AI and user trust in phishing detection systems.
arXiv Detail & Related papers (2025-03-22T23:37:35Z) - Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions [51.51850981481236]
We introduce POATE, a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses.<n>PoATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety.<n>To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses.
arXiv Detail & Related papers (2025-01-03T15:40:03Z) - When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations [58.27927090394458]
Large Language Models (LLMs) are known to be vulnerable to backdoor attacks.<n>In this paper, we examine backdoor attacks through the novel lens of natural language explanations.<n>Our results show that backdoored models produce coherent explanations for clean inputs but diverse and logically flawed explanations for poisoned data.
arXiv Detail & Related papers (2024-11-19T18:11:36Z) - APOLLO: A GPT-based tool to detect phishing emails and generate explanations that warn users [2.3618982787621]
Large Language Models (LLMs) offer significant promise for text processing in various domains.
We present APOLLO, a tool based on OpenAI's GPT-4o to detect phishing emails and generate explanation messages.
We also conducted a study with 20 participants, comparing four different explanations presented as phishing warnings.
arXiv Detail & Related papers (2024-10-10T14:53:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.