Related papers: Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI

Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI

URL: http://arxiv.org/abs/2212.09667v2
Date: Fri, 19 May 2023 05:19:18 GMT
Title: Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI
Authors: Alex Mei, Sharon Levy, William Yang Wang
Abstract summary: Covertly unsafe text is an area of particular interest, as such text may arise from everyday scenarios and are challenging to detect as harmful. We propose FARM, a novel framework leveraging external knowledge for trustworthy rationale generation in the context of safety. Our experiments show that FARM obtains state-of-the-art results on the SafeText dataset, showing absolute improvement in safety classification accuracy by 5.9%.
Score: 76.28956947107372
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Users' physical safety is an increasing concern as the market for intelligent systems continues to grow, where unconstrained systems may recommend users dangerous actions that can lead to serious injury. Covertly unsafe text is an area of particular interest, as such text may arise from everyday scenarios and are challenging to detect as harmful. We propose FARM, a novel framework leveraging external knowledge for trustworthy rationale generation in the context of safety. In particular, FARM foveates on missing knowledge to qualify the information required to reason in specific scenarios and retrieves this information with attribution to trustworthy sources. This knowledge is used to both classify the safety of the original text and generate human-interpretable rationales, shedding light on the risk of systems to specific user groups and helping both stakeholders manage the risks of their systems and policymakers to provide concrete safeguards for consumer safety. Our experiments show that FARM obtains state-of-the-art results on the SafeText dataset, showing absolute improvement in safety classification accuracy by 5.9%.

Related papers

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
Assessing confidence in frontier AI safety cases [37.839615078345886]
A safety case presents a structured argument in support of a top-level claim about a safety property of the system. This raises the question of what level of confidence should be associated with a top-level claim. We propose a method by which AI developers can prioritise, and thereby make their investigation of argument defeaters more efficient.
arXiv Detail & Related papers (2025-02-09T06:35:11Z)
Open Problems in Machine Unlearning for AI Safety [61.43515658834902]
Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety.
arXiv Detail & Related papers (2025-01-09T03:59:10Z)
Usage Governance Advisor: From Intent to AI Governance [4.49852442764084]
evaluating the safety of AI systems is a pressing concern for organizations deploying them. We present Usage Governance Advisor which creates semi-structured governance information.
arXiv Detail & Related papers (2024-12-02T20:36:41Z)
Defining and Evaluating Physical Safety for Large Language Models [62.4971588282174]
Large Language Models (LLMs) are increasingly used to control robotic systems such as drones. Their risks of causing physical threats and harm in real-world applications remain unexplored. We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations.
arXiv Detail & Related papers (2024-11-04T17:41:25Z)
SafetyAnalyst: Interpretable, transparent, and steerable safety moderation for AI behavior [56.10557932893919]
We present SafetyAnalyst, a novel AI safety moderation framework. Given an AI behavior, SafetyAnalyst uses chain-of-thought reasoning to analyze its potential consequences. It aggregates all harmful and beneficial effects into a harmfulness score using fully interpretable weight parameters.
arXiv Detail & Related papers (2024-10-22T03:38:37Z)
Elevating Software Trust: Unveiling and Quantifying the Risk Landscape [9.428116807615407]
We propose a risk assessment framework called SAFER (Software Analysis Framework for Evaluating Risk) This framework is based on the necessity of a dynamic, data-driven, and adaptable process to quantify security risk in the software supply chain. The results suggest that SAFER mitigates subjectivity and yields dynamic data-driven weights as well as security risk scores.
arXiv Detail & Related papers (2024-08-06T00:50:08Z)
Breach By A Thousand Leaks: Unsafe Information Leakage in `Safe' AI Responses [42.136793654338106]
We introduce a new safety evaluation framework based on impermissible information leakage of model outputs. We show that to ensure safety against inferential adversaries, defense mechanisms must ensure information censorship.
arXiv Detail & Related papers (2024-07-02T16:19:25Z)
Cross-Modality Safety Alignment [73.8765529028288]
We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
arXiv Detail & Related papers (2024-06-21T16:14:15Z)
Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching [77.36097118561057]
textscSafePatching is a novel framework for comprehensive and efficient PSA. textscSafePatching achieves a more comprehensive and efficient PSA than baseline methods.
arXiv Detail & Related papers (2024-05-22T16:51:07Z)
Mapping LLM Security Landscapes: A Comprehensive Stakeholder Risk Assessment Proposal [0.0]
We propose a risk assessment process using tools like the risk rating methodology which is used for traditional systems. We conduct scenario analysis to identify potential threat agents and map the dependent system components against vulnerability factors. We also map threats against three key stakeholder groups.
arXiv Detail & Related papers (2024-03-20T05:17:22Z)
Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science [65.77763092833348]
Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, these agents also introduce novel vulnerabilities that demand careful consideration for safety. This paper conducts a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures.
arXiv Detail & Related papers (2024-02-06T18:54:07Z)
Mitigating Covertly Unsafe Text within Natural Language Systems [55.26364166702625]
Uncontrolled systems may generate recommendations that lead to injury or life-threatening consequences. In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text.
arXiv Detail & Related papers (2022-10-17T17:59:49Z)
Towards a Responsible AI Development Lifecycle: Lessons From Information Security [0.0]
We propose a framework for responsibly developing artificial intelligence systems. In particular, we propose leveraging the concepts of threat modeling, design review, penetration testing, and incident response.
arXiv Detail & Related papers (2022-03-06T13:03:58Z)
SoK: A Framework for Unifying At-Risk User Research [18.216554583064063]
At-risk users are people who experience elevated digital security, privacy, and safety threats because of what they do. We present a framework for reasoning about at-risk users based on a wide-ranging meta-analysis of 85 papers.
arXiv Detail & Related papers (2021-12-13T22:27:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.