Nuanced Safety for Generative AI: How Demographics Shape Responsiveness to Severity
- URL: http://arxiv.org/abs/2503.05609v1
- Date: Fri, 07 Mar 2025 17:32:31 GMT
- Title: Nuanced Safety for Generative AI: How Demographics Shape Responsiveness to Severity
- Authors: Pushkar Mishra, Charvi Rastogi, Stephen R. Pfohl, Alicia Parrish, Roma Patel, Mark Diaz, Ding Wang, Michela Paganini, Vinodkumar Prabhakaran, Lora Aroyo, Verena Rieser,
- Abstract summary: We introduce a novel data-driven approach for calibrating granular ratings in pluralistic datasets.<n>We distill non-parametric responsiveness metrics that quantify the consistency of raters in scoring the varying levels of the severity of safety violations.<n>We show that our approach offers improved capabilities for prioritizing safety concerns by capturing nuanced viewpoints across different demographic groups.
- Score: 28.05638097604126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensuring safety of Generative AI requires a nuanced understanding of pluralistic viewpoints. In this paper, we introduce a novel data-driven approach for calibrating granular ratings in pluralistic datasets. Specifically, we address the challenge of interpreting responses of a diverse population to safety expressed via ordinal scales (e.g., Likert scale). We distill non-parametric responsiveness metrics that quantify the consistency of raters in scoring the varying levels of the severity of safety violations. Using safety evaluation of AI-generated content as a case study, we investigate how raters from different demographic groups (age, gender, ethnicity) use an ordinal scale to express their perception of the severity of violations in a pluralistic safety dataset. We apply our metrics across violation types, demonstrating their utility in extracting nuanced insights that are crucial for developing reliable AI systems in a multi-cultural contexts. We show that our approach offers improved capabilities for prioritizing safety concerns by capturing nuanced viewpoints across different demographic groups, hence improving the reliability of pluralistic data collection and in turn contributing to more robust AI evaluations.
Related papers
- Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.
We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [0.0]
Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents.
Recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses.
This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation.
arXiv Detail & Related papers (2025-04-10T16:00:59Z) - AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.
The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z) - Insights on Disagreement Patterns in Multimodal Safety Perception across Diverse Rater Groups [29.720095331989064]
AI systems crucially rely on human ratings, but these ratings are often aggregated.
This is particularly concerning when evaluating the safety of generative AI, where perceptions and associated harms can vary significantly across socio-cultural contexts.
We conduct a large-scale study employing highly-parallel safety ratings of about 1000 text-to-image (T2I) generations from a demographically diverse rater pool of 630 raters.
arXiv Detail & Related papers (2024-10-22T13:59:21Z) - Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety.
For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context.
We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z) - Interpretable Rule-Based System for Radar-Based Gesture Sensing: Enhancing Transparency and Personalization in AI [2.99664686845172]
We introduce MIRA, a transparent and interpretable multi-class rule-based algorithm tailored for radar-based gesture detection.
We showcase the system's adaptability through personalized rule sets that calibrate to individual user behavior, offering a user-centric AI experience.
Our research underscores MIRA's ability to deliver both high interpretability and performance and emphasizes the potential for broader adoption of interpretable AI in safety-critical applications.
arXiv Detail & Related papers (2024-09-30T16:40:27Z) - Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? [59.96471873997733]
We propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context.<n>We aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
arXiv Detail & Related papers (2024-07-31T17:59:24Z) - GRASP: A Disagreement Analysis Framework to Assess Group Associations in Perspectives [18.574420136899978]
We propose GRASP, a comprehensive disagreement analysis framework to measure group association in perspectives among different rater sub-groups.
Our framework reveals specific rater groups that have significantly different perspectives than others on certain tasks, and helps identify demographic axes that are crucial to consider in specific task contexts.
arXiv Detail & Related papers (2023-11-09T00:12:21Z) - ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z) - Building Safe and Reliable AI systems for Safety Critical Tasks with
Vision-Language Processing [1.2183405753834557]
Current AI algorithms are unable to identify common causes for failure detection.
Additional techniques are required to quantify the quality of predictions.
This thesis will focus on vision-language data processing for tasks like classification, image captioning, and vision question answering.
arXiv Detail & Related papers (2023-08-06T18:05:59Z) - Trustworthy AI [75.99046162669997]
Brittleness to minor adversarial changes in the input data, ability to explain the decisions, address the bias in their training data, are some of the most prominent limitations.
We propose the tutorial on Trustworthy AI to address six critical issues in enhancing user and public trust in AI systems.
arXiv Detail & Related papers (2020-11-02T20:04:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.