Foveate, Attribute, and Rationalize: Towards Physically Safe and
Trustworthy AI
- URL: http://arxiv.org/abs/2212.09667v2
- Date: Fri, 19 May 2023 05:19:18 GMT
- Title: Foveate, Attribute, and Rationalize: Towards Physically Safe and
Trustworthy AI
- Authors: Alex Mei, Sharon Levy, William Yang Wang
- Abstract summary: Covertly unsafe text is an area of particular interest, as such text may arise from everyday scenarios and are challenging to detect as harmful.
We propose FARM, a novel framework leveraging external knowledge for trustworthy rationale generation in the context of safety.
Our experiments show that FARM obtains state-of-the-art results on the SafeText dataset, showing absolute improvement in safety classification accuracy by 5.9%.
- Score: 76.28956947107372
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Users' physical safety is an increasing concern as the market for intelligent
systems continues to grow, where unconstrained systems may recommend users
dangerous actions that can lead to serious injury. Covertly unsafe text is an
area of particular interest, as such text may arise from everyday scenarios and
are challenging to detect as harmful. We propose FARM, a novel framework
leveraging external knowledge for trustworthy rationale generation in the
context of safety. In particular, FARM foveates on missing knowledge to qualify
the information required to reason in specific scenarios and retrieves this
information with attribution to trustworthy sources. This knowledge is used to
both classify the safety of the original text and generate human-interpretable
rationales, shedding light on the risk of systems to specific user groups and
helping both stakeholders manage the risks of their systems and policymakers to
provide concrete safeguards for consumer safety. Our experiments show that FARM
obtains state-of-the-art results on the SafeText dataset, showing absolute
improvement in safety classification accuracy by 5.9%.
Related papers
- Defining and Evaluating Physical Safety for Large Language Models [62.4971588282174]
Large Language Models (LLMs) are increasingly used to control robotic systems such as drones.
Their risks of causing physical threats and harm in real-world applications remain unexplored.
We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations.
arXiv Detail & Related papers (2024-11-04T17:41:25Z) - Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses [42.136793654338106]
We introduce a new safety evaluation framework based on impermissible information leakage of model outputs.
We show that to ensure safety against inferential adversaries, defense mechanisms must ensure information censorship.
arXiv Detail & Related papers (2024-07-02T16:19:25Z) - Cross-Modality Safety Alignment [73.8765529028288]
We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment.
To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations.
Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
arXiv Detail & Related papers (2024-06-21T16:14:15Z) - Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching [77.36097118561057]
textscSafePatching is a novel framework for comprehensive and efficient PSA.
textscSafePatching achieves a more comprehensive and efficient PSA than baseline methods.
arXiv Detail & Related papers (2024-05-22T16:51:07Z) - Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal [0.0]
We propose a risk assessment process using tools like the risk rating methodology which is used for traditional systems.
We conduct scenario analysis to identify potential threat agents and map the dependent system components against vulnerability factors.
We also map threats against three key stakeholder groups.
arXiv Detail & Related papers (2024-03-20T05:17:22Z) - Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science [65.77763092833348]
Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines.
While their capabilities are promising, these agents also introduce novel vulnerabilities that demand careful consideration for safety.
This paper conducts a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures.
arXiv Detail & Related papers (2024-02-06T18:54:07Z) - Mitigating Covertly Unsafe Text within Natural Language Systems [55.26364166702625]
Uncontrolled systems may generate recommendations that lead to injury or life-threatening consequences.
In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text.
arXiv Detail & Related papers (2022-10-17T17:59:49Z) - Towards a Responsible AI Development Lifecycle: Lessons From Information
Security [0.0]
We propose a framework for responsibly developing artificial intelligence systems.
In particular, we propose leveraging the concepts of threat modeling, design review, penetration testing, and incident response.
arXiv Detail & Related papers (2022-03-06T13:03:58Z) - SoK: A Framework for Unifying At-Risk User Research [18.216554583064063]
At-risk users are people who experience elevated digital security, privacy, and safety threats because of what they do.
We present a framework for reasoning about at-risk users based on a wide-ranging meta-analysis of 85 papers.
arXiv Detail & Related papers (2021-12-13T22:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.