Related papers: SoK: Unintended Interactions among Machine Learning Defenses and Risks

Related papers

From Defender to Devil? Unintended Risk Interactions Induced by LLM Defenses [18.096213847353965]
Large Language Models (LLMs) have shown remarkable performance across various applications, but their deployment in sensitive domains raises significant concerns.<n>We take the first step in investigating unintended interactions caused by defenses in LLMs, focusing on the complex interplay between safety, fairness, and privacy.<n>We propose CrossRiskEval, a comprehensive evaluation framework to assess whether deploying a defense targeting one risk inadvertently affects others.
arXiv Detail & Related papers (2025-10-09T09:00:00Z)
The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration [72.33801123508145]
Large language models (LLMs) are integral to multi-agent systems.<n>Privacy risks emerge that extend beyond memorization, direct inference, or single-turn evaluations.<n>In particular, seemingly innocuous responses, when composed across interactions, can cumulatively enable adversaries to recover sensitive information.
arXiv Detail & Related papers (2025-09-16T16:57:25Z)
A Systematic Survey of Model Extraction Attacks and Defenses: State-of-the-Art and Perspectives [65.3369988566853]
Recent studies have demonstrated that adversaries can replicate a target model's functionality.<n>Model Extraction Attacks pose threats to intellectual property, privacy, and system security.<n>We propose a novel taxonomy that classifies MEAs according to attack mechanisms, defense approaches, and computing environments.
arXiv Detail & Related papers (2025-08-20T19:49:59Z)
Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers [61.57691030102618]
We propose a novel jailbreaking method, Paper Summary Attack (llmnamePSA)<n>It synthesizes content from either attack-focused or defense-focused LLM safety paper to construct an adversarial prompt template.<n>Experiments show significant vulnerabilities not only in base LLMs, but also in state-of-the-art reasoning model like Deepseek-R1.
arXiv Detail & Related papers (2025-07-17T18:33:50Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks [104.50239783909063]
Multimodal large language models with Retrieval Augmented Generation (RAG) have significantly advanced tasks such as multimodal question answering.<n>This reliance on external knowledge poses a critical yet underexplored safety risk: knowledge poisoning attacks.<n>We propose MM-PoisonRAG, the first framework to systematically design knowledge poisoning in multimodal RAG.
arXiv Detail & Related papers (2025-02-25T04:23:59Z)
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework. It reformulates harmful queries into benign reasoning tasks. We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z)
Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges [52.96987928118327]
We find that embedding models for retrieval, rerankers, and large language model (LLM) relevance judges are vulnerable to content injection attacks. We identify two primary threats: (1) inserting unrelated or harmful content within passages that still appear deceptively "relevant", and (2) inserting entire queries or key query terms into passages to boost their perceived relevance. Our study systematically examines the factors that influence an attack's success, such as the placement of injected content and the balance between relevant and non-relevant material.
arXiv Detail & Related papers (2025-01-30T18:02:15Z)
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense [56.32083100401117]
We investigate why Vision Large Language Models (VLLMs) are prone to jailbreak attacks. We then make a key observation: existing defense mechanisms suffer from an textbfover-prudence problem. We find that the two representative evaluation methods for jailbreak often exhibit chance agreement.
arXiv Detail & Related papers (2024-11-13T07:57:19Z)
Risks and NLP Design: A Case Study on Procedural Document QA [52.557503571760215]
We argue that clearer assessments of risks and harms to users will be possible when we specialize the analysis to more concrete applications and their plausible users. We conduct a risk-oriented error analysis that could then inform the design of a future system to be deployed with lower risk of harm and better performance.
arXiv Detail & Related papers (2024-08-16T17:23:43Z)
Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses [42.136793654338106]
We introduce a new safety evaluation framework based on impermissible information leakage of model outputs. We show that to ensure safety against inferential adversaries, defense mechanisms must ensure information censorship.
arXiv Detail & Related papers (2024-07-02T16:19:25Z)
Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science [65.77763092833348]
Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, these agents also introduce novel vulnerabilities that demand careful consideration for safety. This paper conducts a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures.
arXiv Detail & Related papers (2024-02-06T18:54:07Z)
A Security Risk Taxonomy for Prompt-Based Interaction With Large Language Models [5.077431021127288]
This paper addresses a gap in current research by focusing on security risks posed by large language models (LLMs) Our work proposes a taxonomy of security risks along the user-model communication pipeline and categorizes the attacks by target and attack type alongside the commonly used confidentiality, integrity, and availability (CIA) triad.
arXiv Detail & Related papers (2023-11-19T20:22:05Z)
Measuring Equality in Machine Learning Security Defenses: A Case Study in Speech Recognition [56.69875958980474]
This work considers approaches to defending learned systems and how security defenses result in performance inequities across different sub-populations. We find that many methods that have been proposed can cause direct harm, like false rejection and unequal benefits from robustness training. We present a comparison of equality between two rejection-based defenses: randomized smoothing and neural rejection, finding randomized smoothing more equitable due to the sampling mechanism for minority groups.
arXiv Detail & Related papers (2023-02-17T16:19:26Z)
Conflicting Interactions Among Protections Mechanisms for Machine Learning Models [15.047412609389983]
ML models have become targets for various attacks. Research at the intersection of security and privacy, and ML has flourished. A solution that is optimal for a specific concern may interact negatively with solutions intended to address other concerns.
arXiv Detail & Related papers (2022-07-05T12:18:06Z)
Towards A Critical Evaluation of Robustness for Deep Learning Backdoor Countermeasures [13.56551253289911]
We critically examine the robustness of existing backdoor countermeasures with an initial focus on three influential model-inspection ones. Although the three countermeasures claim that they work well under their respective threat models, they have inherent unexplored non-robust cases. This work highlights the necessity of thoroughly evaluating the robustness of backdoor countermeasures to avoid their misleading security implications.
arXiv Detail & Related papers (2022-04-13T09:50:17Z)
Where Did You Learn That From? Surprising Effectiveness of Membership Inference Attacks Against Temporally Correlated Data in Deep Reinforcement Learning [114.9857000195174]
A major challenge to widespread industrial adoption of deep reinforcement learning is the potential vulnerability to privacy breaches. We propose an adversarial attack framework tailored for testing the vulnerability of deep reinforcement learning algorithms to membership inference attacks.
arXiv Detail & Related papers (2021-09-08T23:44:57Z)
Adversarial Visual Robustness by Causal Intervention [56.766342028800445]
Adversarial training is the de facto most promising defense against adversarial examples. Yet, its passive nature inevitably prevents it from being immune to unknown attackers. We provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning.
arXiv Detail & Related papers (2021-06-17T14:23:54Z)
Adversarially Robust Learning with Unknown Perturbation Sets [37.13850246542325]
We study the problem of learning predictors that are robust to adversarial examples with respect to an unknown perturbation set. We obtain upper bounds on the sample complexity and upper and lower bounds on the number of required interactions, or number of successful attacks.
arXiv Detail & Related papers (2021-02-03T17:01:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.