Foveate, Attribute, and Rationalize: Towards Physically Safe and
Trustworthy AI
- URL: http://arxiv.org/abs/2212.09667v2
- Date: Fri, 19 May 2023 05:19:18 GMT
- Title: Foveate, Attribute, and Rationalize: Towards Physically Safe and
Trustworthy AI
- Authors: Alex Mei, Sharon Levy, William Yang Wang
- Abstract summary: Covertly unsafe text is an area of particular interest, as such text may arise from everyday scenarios and are challenging to detect as harmful.
We propose FARM, a novel framework leveraging external knowledge for trustworthy rationale generation in the context of safety.
Our experiments show that FARM obtains state-of-the-art results on the SafeText dataset, showing absolute improvement in safety classification accuracy by 5.9%.
- Score: 76.28956947107372
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Users' physical safety is an increasing concern as the market for intelligent
systems continues to grow, where unconstrained systems may recommend users
dangerous actions that can lead to serious injury. Covertly unsafe text is an
area of particular interest, as such text may arise from everyday scenarios and
are challenging to detect as harmful. We propose FARM, a novel framework
leveraging external knowledge for trustworthy rationale generation in the
context of safety. In particular, FARM foveates on missing knowledge to qualify
the information required to reason in specific scenarios and retrieves this
information with attribution to trustworthy sources. This knowledge is used to
both classify the safety of the original text and generate human-interpretable
rationales, shedding light on the risk of systems to specific user groups and
helping both stakeholders manage the risks of their systems and policymakers to
provide concrete safeguards for consumer safety. Our experiments show that FARM
obtains state-of-the-art results on the SafeText dataset, showing absolute
improvement in safety classification accuracy by 5.9%.
Related papers
- Assessing confidence in frontier AI safety cases [37.839615078345886]
A safety case presents a structured argument in support of a top-level claim about a safety property of the system.
This raises the question of what level of confidence should be associated with a top-level claim.
We propose a method by which AI developers can prioritise, and thereby make their investigation of argument defeaters more efficient.
arXiv Detail & Related papers (2025-02-09T06:35:11Z) - Open Problems in Machine Unlearning for AI Safety [61.43515658834902]
Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks.
In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety.
arXiv Detail & Related papers (2025-01-09T03:59:10Z) - Usage Governance Advisor: From Intent to AI Governance [4.49852442764084]
evaluating the safety of AI systems is a pressing concern for organizations deploying them.
We present Usage Governance Advisor which creates semi-structured governance information.
arXiv Detail & Related papers (2024-12-02T20:36:41Z) - SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior [56.10557932893919]
We present SafetyAnalyst, a novel AI safety moderation framework.
Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences.
It aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters.
arXiv Detail & Related papers (2024-10-22T03:38:37Z) - Elevating Software Trust: Unveiling and Quantifying the Risk Landscape [9.428116807615407]
We propose a risk assessment framework called SAFER (Software Analysis Framework for Evaluating Risk)
This framework is based on the necessity of a dynamic, data-driven, and adaptable process to quantify security risk in the software supply chain.
The results suggest that SAFER mitigates subjectivity and yields dynamic data-driven weights as well as security risk scores.
arXiv Detail & Related papers (2024-08-06T00:50:08Z) - Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses [42.136793654338106]
We introduce a new safety evaluation framework based on impermissible information leakage of model outputs.
We show that to ensure safety against inferential adversaries, defense mechanisms must ensure information censorship.
arXiv Detail & Related papers (2024-07-02T16:19:25Z) - Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Model [73.8765529028288]
We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment.
To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations.
Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
arXiv Detail & Related papers (2024-06-21T16:14:15Z) - Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science [65.77763092833348]
Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines.
While their capabilities are promising, these agents also introduce novel vulnerabilities that demand careful consideration for safety.
This paper conducts a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures.
arXiv Detail & Related papers (2024-02-06T18:54:07Z) - Mitigating Covertly Unsafe Text within Natural Language Systems [55.26364166702625]
Uncontrolled systems may generate recommendations that lead to injury or life-threatening consequences.
In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text.
arXiv Detail & Related papers (2022-10-17T17:59:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.