PRISM: A Design Framework for Open-Source Foundation Model Safety
- URL: http://arxiv.org/abs/2406.10415v1
- Date: Fri, 14 Jun 2024 21:26:15 GMT
- Title: PRISM: A Design Framework for Open-Source Foundation Model Safety
- Authors: Terrence Neumann, Bryan Jones,
- Abstract summary: This paper addresses the question of how open foundation model developers should approach model safety.
We introduce PRISM, a design framework for open-source foundation model safety that emphasizes Private, Robust, Independent Safety measures.
PRISM aims to create a safer open-source ecosystem that maximizes the potential of these powerful technologies while minimizing the risks to individuals and society as a whole.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of open-source foundation models has brought transparency and accessibility to this groundbreaking technology. However, this openness has also enabled the development of highly-capable, unsafe models, as exemplified by recent instances such as WormGPT and FraudGPT, which are specifically designed to facilitate criminal activity. As the capabilities of open foundation models continue to grow, potentially outpacing those of closed-source models, the risk of misuse by bad actors poses an increasingly serious threat to society. This paper addresses the critical question of how open foundation model developers should approach model safety in light of these challenges. Our analysis reveals that open-source foundation model companies often provide less restrictive acceptable use policies (AUPs) compared to their closed-source counterparts, likely due to the inherent difficulties in enforcing such policies once the models are released. To tackle this issue, we introduce PRISM, a design framework for open-source foundation model safety that emphasizes Private, Robust, Independent Safety measures, at Minimal marginal cost of compute. The PRISM framework proposes the use of modular functions that moderate prompts and outputs independently of the core language model, offering a more adaptable and resilient approach to safety compared to the brittle reinforcement learning methods currently used for value alignment. By focusing on identifying AUP violations and engaging the developer community in establishing consensus around safety design decisions, PRISM aims to create a safer open-source ecosystem that maximizes the potential of these powerful technologies while minimizing the risks to individuals and society as a whole.
Related papers
- Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks.
This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z) - Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge [17.3048898399324]
democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies.
backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability.
This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities.
arXiv Detail & Related papers (2024-02-29T16:37:08Z) - On the Societal Impact of Open Foundation Models [93.67389739906561]
We focus on open foundation models, defined here as those with broadly available model weights.
We identify five distinctive properties of open foundation models that lead to both their benefits and risks.
arXiv Detail & Related papers (2024-02-27T16:49:53Z) - Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications [73.58451824894568]
The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states.
In safety-critical domains, such behaviors could lead to disastrous outcomes.
We propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic.
arXiv Detail & Related papers (2024-02-23T23:22:06Z) - Open-Sourcing Highly Capable Foundation Models: An evaluation of risks,
benefits, and alternative methods for pursuing open-source objectives [6.575445633821399]
Recent decisions by leading AI labs to either open-source their models or to restrict access to their models has sparked debate.
This paper offers an examination of the risks and benefits of open-sourcing highly capable foundation models.
arXiv Detail & Related papers (2023-09-29T17:03:45Z) - Balancing Transparency and Risk: The Security and Privacy Risks of
Open-Source Machine Learning Models [31.658006126446175]
We present a comprehensive overview of common privacy and security threats associated with the use of open-source models.
By raising awareness of these dangers, we strive to promote the responsible and secure use of AI systems.
arXiv Detail & Related papers (2023-08-18T11:59:15Z) - Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of
Foundation Models [103.71308117592963]
We present an algorithm for training self-destructing models leveraging techniques from meta-learning and adversarial learning.
In a small-scale experiment, we show MLAC can largely prevent a BERT-style model from being re-purposed to perform gender identification.
arXiv Detail & Related papers (2022-11-27T21:43:45Z) - Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement
Learning in Unknown Stochastic Environments [84.3830478851369]
We propose a safe reinforcement learning approach that can jointly learn the environment and optimize the control policy.
Our approach can effectively enforce hard safety constraints and significantly outperform CMDP-based baseline methods in system safe rate measured via simulations.
arXiv Detail & Related papers (2022-09-29T20:49:25Z) - Safe Exploration in Model-based Reinforcement Learning using Control
Barrier Functions [1.005130974691351]
We develop a novel class of CBFs that retain the beneficial properties of CBFs for developing minimally-invasive safe control policies.
We show how these LCBFs can be used to augment a learning-based control policy so as to guarantee safety and then leverage this approach to develop a safe exploration framework.
arXiv Detail & Related papers (2021-04-16T15:29:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.