Access Controls Will Solve the Dual-Use Dilemma
- URL: http://arxiv.org/abs/2505.09341v3
- Date: Mon, 14 Jul 2025 06:49:24 GMT
- Title: Access Controls Will Solve the Dual-Use Dilemma
- Authors: Evžen Wybitul,
- Abstract summary: It is unclear whether to answer dual-use requests, since the same query could be either harmless or harmful depending on who made it and why.<n>To make better decisions, such systems would need to examine requests' real-world context.<n>We propose a conceptual framework based on access controls where only verified users can access dual-use outputs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI safety systems face the dual-use dilemma. It is unclear whether to answer dual-use requests, since the same query could be either harmless or harmful depending on who made it and why. To make better decisions, such systems would need to examine requests' real-world context, but currently, they lack access to this information. Instead, they sometimes end up making arbitrary choices that result in refusing legitimate queries and allowing harmful ones, which hurts both utility and safety. To address this, we propose a conceptual framework based on access controls where only verified users can access dual-use outputs. We describe the framework's components, analyse its feasibility, and explain how it addresses both over-refusals and under-refusals. While only a high-level proposal, our work takes the first step toward giving model providers more granular tools for managing dual-use content. Such tools would enable users to access more capabilities without sacrificing safety, and offer regulators new options for targeted policies.
Related papers
- Uncertainty-Aware GUI Agent: Adaptive Perception through Component Recommendation and Human-in-the-Loop Refinement [11.63498742723335]
We present textbfRecAgent, an uncertainty-aware agent that addresses these issues through adaptive perception.<n>To reduce perceptual uncertainty, RecAgent employs a component recommendation mechanism that identifies and focuses on the most relevant UI elements.<n>For decision uncertainty, it uses an interactive module to request user feedback in ambiguous situations, enabling intent-aware decisions.
arXiv Detail & Related papers (2025-08-06T02:38:02Z) - Beyond Release: Access Considerations for Generative AI Systems [33.117342870212156]
Generative AI release decisions determine whether system components are made available, but release does not address many other elements that change how users and stakeholders are able to engage with a system.<n>Access to system components informs potential risks and benefits.<n>This framework better encompasses the landscape and risk-benefit tradeoffs of system releases to inform system release decisions, research, and policy.
arXiv Detail & Related papers (2025-02-23T20:06:12Z) - Online Clustering of Dueling Bandits [59.09590979404303]
We introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback.<n>We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions.
arXiv Detail & Related papers (2025-02-04T07:55:41Z) - AlignGuard: Scalable Safety Alignment for Text-to-Image Generation [68.07258248467309]
Text-to-image (T2I) models are widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse.<n>In this work, we introduce AlignGuard, a method for safety alignment of T2I models.
arXiv Detail & Related papers (2024-12-13T18:59:52Z) - Usage Governance Advisor: From Intent to AI Governance [4.49852442764084]
evaluating the safety of AI systems is a pressing concern for organizations deploying them.<n>We present Usage Governance Advisor which creates semi-structured governance information.
arXiv Detail & Related papers (2024-12-02T20:36:41Z) - Self-Defense: Optimal QIF Solutions and Application to Website Fingerprinting [8.227044921274494]
Quantitative Information Flow (QIF) provides a robust information-theoretical framework for designing secure systems with minimal information leakage.
We propose optimal solutions for constructing a new row, in a known and unmodifiable information-theoretic channel, aiming at minimizing the leakage.
We apply our approach to the problem of website fingerprinting defense, considering a scenario where a site administrator can modify their own site but not others.
arXiv Detail & Related papers (2024-11-15T09:22:14Z) - Combining AI Control Systems and Human Decision Support via Robustness and Criticality [53.10194953873209]
We extend a methodology for adversarial explanations (AE) to state-of-the-art reinforcement learning frameworks.
We show that the learned AI control system demonstrates robustness against adversarial tampering.
In a training / learning framework, this technology can improve both the AI's decisions and explanations through human interaction.
arXiv Detail & Related papers (2024-07-03T15:38:57Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z) - Towards Formal Fault Injection for Safety Assessment of Automated
Systems [0.0]
This paper introduces formal fault injection, a fusion of these two techniques throughout the development lifecycle.
We advocate for a more cohesive approach by identifying five areas of mutual support between formal methods and fault injection.
arXiv Detail & Related papers (2023-11-16T11:34:18Z) - Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models.
It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content.
Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z) - Against Algorithmic Exploitation of Human Vulnerabilities [2.6918074738262194]
We are concerned with the problem of machine learning models inadvertently modelling vulnerabilities.
We describe common vulnerabilities, and illustrate cases where they are likely to play a role in algorithmic decision-making.
We propose a set of requirements for methods to detect the potential for vulnerability modelling.
arXiv Detail & Related papers (2023-01-12T13:15:24Z) - Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of
Foundation Models [103.71308117592963]
We present an algorithm for training self-destructing models leveraging techniques from meta-learning and adversarial learning.
In a small-scale experiment, we show MLAC can largely prevent a BERT-style model from being re-purposed to perform gender identification.
arXiv Detail & Related papers (2022-11-27T21:43:45Z) - Explainable Abuse Detection as Intent Classification and Slot Filling [66.80201541759409]
We introduce the concept of policy-aware abuse detection, abandoning the unrealistic expectation that systems can reliably learn which phenomena constitute abuse from inspecting the data alone.
We show how architectures for intent classification and slot filling can be used for abuse detection, while providing a rationale for model decisions.
arXiv Detail & Related papers (2022-10-06T03:33:30Z) - Two-stage Voice Application Recommender System for Unhandled Utterances
in Intelligent Personal Assistant [5.475452673163167]
We propose a two-stage shortlister-reranker recommender system to match third-party voice applications to unhandled utterances.
We show how to build a new system by using observed data collected from a baseline rule-based system.
We present online A/B testing results that show a significant boost on user experience satisfaction.
arXiv Detail & Related papers (2021-10-19T11:52:56Z) - A Conceptual Framework for Establishing Trust in Real World Intelligent
Systems [0.0]
Trust in algorithms can be established by letting users interact with the system.
Reflecting features and patterns of human understanding of a domain against algorithmic results can create awareness of such patterns.
Close inspection can be used to decide whether a solution conforms to the expectations or whether it goes beyond the expected.
arXiv Detail & Related papers (2021-04-12T12:58:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.