On the Trade-Off Between Transparency and Security in Adversarial Machine Learning
- URL: http://arxiv.org/abs/2511.11842v1
- Date: Fri, 14 Nov 2025 20:05:50 GMT
- Title: On the Trade-Off Between Transparency and Security in Adversarial Machine Learning
- Authors: Lucas Fenaux, Christopher Srinivasa, Florian Kerschbaum,
- Abstract summary: We investigate the strategic effect of transparency for agents through the lens of transferable adversarial example attacks.<n>In transferable adversarial example attacks, attackers maliciously perturb their inputs using surrogate models to fool a defender's target model.<n>We find that attackers are more successful when they match the defender's decision.
- Score: 19.827079641936837
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transparency and security are both central to Responsible AI, but they may conflict in adversarial settings. We investigate the strategic effect of transparency for agents through the lens of transferable adversarial example attacks. In transferable adversarial example attacks, attackers maliciously perturb their inputs using surrogate models to fool a defender's target model. These models can be defended or undefended, with both players having to decide which to use. Using a large-scale empirical evaluation of nine attacks across 181 models, we find that attackers are more successful when they match the defender's decision; hence, obscurity could be beneficial to the defender. With game theory, we analyze this trade-off between transparency and security by modeling this problem as both a Nash game and a Stackelberg game, and comparing the expected outcomes. Our analysis confirms that only knowing whether a defender's model is defended or not can sometimes be enough to damage its security. This result serves as an indicator of the general trade-off between transparency and security, suggesting that transparency in AI systems can be at odds with security. Beyond adversarial machine learning, our work illustrates how game-theoretic reasoning can uncover conflicts between transparency and security.
Related papers
- Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models [64.47869632167284]
Conventional language model (LM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities.<n>This sequential approach creates a mismatch -- attackers overfit to obsolete defenses, while defenders perpetually lag behind emerging threats.<n>We propose Self-RedTeam, an online self-play reinforcement learning algorithm where an attacker and defender agent co-evolve through continuous interaction.
arXiv Detail & Related papers (2025-06-09T06:35:12Z) - Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game [28.33029508522531]
Malicious attackers induce large models to jailbreak and generate information containing illegal, privacy-invasive information.
Large models counter malicious attackers' attacks using techniques such as safety alignment.
We propose a multi-agent attacker-disguiser game approach to achieve a weak defense mechanism that allows the large model to both safely reply to the attacker and hide the defense intent.
arXiv Detail & Related papers (2024-04-03T07:43:11Z) - Counter-Samples: A Stateless Strategy to Neutralize Black Box Adversarial Attacks [2.9815109163161204]
Our paper presents a novel defence against black box attacks, where attackers use the victim model as an oracle to craft their adversarial examples.
Unlike traditional preprocessing defences that rely on sanitizing input samples, our strategy counters the attack process itself.
We demonstrate that our approach is remarkably effective against state-of-the-art black box attacks and outperforms existing defences for both the CIFAR-10 and ImageNet datasets.
arXiv Detail & Related papers (2024-03-14T10:59:54Z) - On the Difficulty of Defending Contrastive Learning against Backdoor
Attacks [58.824074124014224]
We show how contrastive backdoor attacks operate through distinctive mechanisms.
Our findings highlight the need for defenses tailored to the specificities of contrastive backdoor attacks.
arXiv Detail & Related papers (2023-12-14T15:54:52Z) - The Best Defense is a Good Offense: Adversarial Augmentation against
Adversarial Attacks [91.56314751983133]
$A5$ is a framework to craft a defensive perturbation to guarantee that any attack towards the input in hand will fail.
We show effective on-the-fly defensive augmentation with a robustifier network that ignores the ground truth label.
We also show how to apply $A5$ to create certifiably robust physical objects.
arXiv Detail & Related papers (2023-05-23T16:07:58Z) - Adversarial Machine Learning and Defense Game for NextG Signal
Classification with Deep Learning [1.1726528038065764]
NextG systems can employ deep neural networks (DNNs) for various tasks such as user equipment identification, physical layer authentication, and detection of incumbent users.
This paper presents a game-theoretic framework to study the interactions of attack and defense for deep learning-based NextG signal classification.
arXiv Detail & Related papers (2022-12-22T15:13:03Z) - Simulation of Attacker Defender Interaction in a Noisy Security Game [1.967117164081002]
We introduce a security game framework that simulates interplay between attackers and defenders in a noisy environment.
We demonstrate the importance of making the right assumptions about attackers, given significant differences in outcomes.
There is a measurable trade-off between false-positives and true-positives in terms of attacker outcomes.
arXiv Detail & Related papers (2022-12-08T14:18:44Z) - Fixed Points in Cyber Space: Rethinking Optimal Evasion Attacks in the
Age of AI-NIDS [70.60975663021952]
We study blackbox adversarial attacks on network classifiers.
We argue that attacker-defender fixed points are themselves general-sum games with complex phase transitions.
We show that a continual learning approach is required to study attacker-defender dynamics.
arXiv Detail & Related papers (2021-11-23T23:42:16Z) - Unrestricted Adversarial Attacks on ImageNet Competition [70.8952435964555]
Unrestricted adversarial attack is popular and practical direction but has not been studied thoroughly.
We organize this competition with the purpose of exploring more effective unrestricted adversarial attack algorithm.
arXiv Detail & Related papers (2021-10-17T04:27:15Z) - The Feasibility and Inevitability of Stealth Attacks [63.14766152741211]
We study new adversarial perturbations that enable an attacker to gain control over decisions in generic Artificial Intelligence systems.
In contrast to adversarial data modification, the attack mechanism we consider here involves alterations to the AI system itself.
arXiv Detail & Related papers (2021-06-26T10:50:07Z) - Adversarial Classification of the Attacks on Smart Grids Using Game
Theory and Deep Learning [27.69899235394942]
This paper proposes a game-theoretic approach to evaluate the variations caused by an attacker on the power measurements.
A zero-sum game is used to model the interactions between the attacker and defender.
arXiv Detail & Related papers (2021-06-06T18:43:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.