Related papers: Nuanced Safety for Generative AI: How Demographics Shape Responsiveness to Severity

Nuanced Safety for Generative AI: How Demographics Shape Responsiveness to Severity

URL: http://arxiv.org/abs/2503.05609v1
Date: Fri, 07 Mar 2025 17:32:31 GMT
Title: Nuanced Safety for Generative AI: How Demographics Shape Responsiveness to Severity
Authors: Pushkar Mishra, Charvi Rastogi, Stephen R. Pfohl, Alicia Parrish, Roma Patel, Mark Diaz, Ding Wang, Michela Paganini, Vinodkumar Prabhakaran, Lora Aroyo, Verena Rieser,
Abstract summary: We introduce a novel data-driven approach for calibrating granular ratings in pluralistic datasets.<n>We distill non-parametric responsiveness metrics that quantify the consistency of raters in scoring the varying levels of the severity of safety violations.<n>We show that our approach offers improved capabilities for prioritizing safety concerns by capturing nuanced viewpoints across different demographic groups.
Score: 28.05638097604126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Ensuring safety of Generative AI requires a nuanced understanding of pluralistic viewpoints. In this paper, we introduce a novel data-driven approach for calibrating granular ratings in pluralistic datasets. Specifically, we address the challenge of interpreting responses of a diverse population to safety expressed via ordinal scales (e.g., Likert scale). We distill non-parametric responsiveness metrics that quantify the consistency of raters in scoring the varying levels of the severity of safety violations. Using safety evaluation of AI-generated content as a case study, we investigate how raters from different demographic groups (age, gender, ethnicity) use an ordinal scale to express their perception of the severity of violations in a pluralistic safety dataset. We apply our metrics across violation types, demonstrating their utility in extracting nuanced insights that are crucial for developing reliable AI systems in a multi-cultural contexts. We show that our approach offers improved capabilities for prioritizing safety concerns by capturing nuanced viewpoints across different demographic groups, hence improving the reliability of pluralistic data collection and in turn contributing to more robust AI evaluations.

Related papers

Whose View of Safety? A Deep DIVE Dataset for Pluralistic Alignment of Text-to-Image Models [29.501859416167385]
Current text-to-image (T2I) models often fail to account for diverse human experiences, leading to misaligned systems.<n>We advocate for pluralistic alignment, where an AI understands and is steerable towards diverse, and often conflicting, human values.
arXiv Detail & Related papers (2025-07-15T21:02:35Z)
Preference Learning for AI Alignment: a Causal Perspective [55.2480439325792]
We frame this problem in a causal paradigm, providing the rich toolbox of causality to identify persistent challenges.<n>Inheriting from the literature of causal inference, we identify key assumptions necessary for reliable generalisation.<n>We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness.
arXiv Detail & Related papers (2025-06-06T10:45:42Z)
Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis [4.3659097510044855]
Our evaluation reveals distinct semantic clusters using UMAP dimensionality reduction and kmeans clustering.<n>We identify six primary harm categories with varying benchmark representation.
arXiv Detail & Related papers (2025-05-23T08:53:11Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [0.0]
Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents. Recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses. This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation.
arXiv Detail & Related papers (2025-04-10T16:00:59Z)
Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations [65.11348389219887]
We introduce Dialectic-RAG (DRAG), a modular approach that evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives.<n>We show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models.
arXiv Detail & Related papers (2025-04-07T06:55:15Z)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
Insights on Disagreement Patterns in Multimodal Safety Perception across Diverse Rater Groups [29.720095331989064]
AI systems crucially rely on human ratings, but these ratings are often aggregated. This is particularly concerning when evaluating the safety of generative AI, where perceptions and associated harms can vary significantly across socio-cultural contexts. We conduct a large-scale study employing highly-parallel safety ratings of about 1000 text-to-image (T2I) generations from a demographically diverse rater pool of 630 raters.
arXiv Detail & Related papers (2024-10-22T13:59:21Z)
Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety. For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context. We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z)
Interpretable Rule-Based System for Radar-Based Gesture Sensing: Enhancing Transparency and Personalization in AI [2.99664686845172]
We introduce MIRA, a transparent and interpretable multi-class rule-based algorithm tailored for radar-based gesture detection. We showcase the system's adaptability through personalized rule sets that calibrate to individual user behavior, offering a user-centric AI experience. Our research underscores MIRA's ability to deliver both high interpretability and performance and emphasizes the potential for broader adoption of interpretable AI in safety-critical applications.
arXiv Detail & Related papers (2024-09-30T16:40:27Z)
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? [59.96471873997733]
We propose an empirical foundation for developing more meaningful safety metrics and define AI safety in a machine learning research context.<n>We aim to provide a more rigorous framework for AI safety research, advancing the science of safety evaluations and clarifying the path towards measurable progress.
arXiv Detail & Related papers (2024-07-31T17:59:24Z)
GRASP: A Disagreement Analysis Framework to Assess Group Associations in Perspectives [18.574420136899978]
We propose GRASP, a comprehensive disagreement analysis framework to measure group association in perspectives among different rater sub-groups. Our framework reveals specific rater groups that have significantly different perspectives than others on certain tasks, and helps identify demographic axes that are crucial to consider in specific task contexts.
arXiv Detail & Related papers (2023-11-09T00:12:21Z)
Measuring Adversarial Datasets [28.221635644616523]
Researchers have curated various adversarial datasets for capturing model deficiencies that cannot be revealed in standard benchmark datasets. There is still no methodology to measure the intended and unintended consequences of those adversarial transformations. We conducted a systematic survey of existing quantifiable metrics that describe text instances in NLP tasks.
arXiv Detail & Related papers (2023-11-06T22:08:16Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
Building Safe and Reliable AI systems for Safety Critical Tasks with Vision-Language Processing [1.2183405753834557]
Current AI algorithms are unable to identify common causes for failure detection. Additional techniques are required to quantify the quality of predictions. This thesis will focus on vision-language data processing for tasks like classification, image captioning, and vision question answering.
arXiv Detail & Related papers (2023-08-06T18:05:59Z)
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs) We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence. We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z)
Variational Distillation for Multi-View Learning [104.17551354374821]
We design several variational information bottlenecks to exploit two key characteristics for multi-view representation learning. Under rigorously theoretical guarantee, our approach enables IB to grasp the intrinsic correlation between observations and semantic labels.
arXiv Detail & Related papers (2022-06-20T03:09:46Z)
Trustworthy AI [75.99046162669997]
Brittleness to minor adversarial changes in the input data, ability to explain the decisions, address the bias in their training data, are some of the most prominent limitations. We propose the tutorial on Trustworthy AI to address six critical issues in enhancing user and public trust in AI systems.
arXiv Detail & Related papers (2020-11-02T20:04:18Z)
Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations [67.4375210552593]
We design experiments to understand an important but often ignored problem in visually grounded language generation. Given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance? We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task.
arXiv Detail & Related papers (2020-10-07T20:45:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.