Related papers: Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis

Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis

URL: http://arxiv.org/abs/2505.17636v1
Date: Fri, 23 May 2025 08:53:11 GMT
Title: Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis
Authors: Jonathan Bennion, Shaona Ghosh, Mantek Singh, Nouha Dziri,
Abstract summary: Our evaluation reveals distinct semantic clusters using UMAP dimensionality reduction and kmeans clustering.<n>We identify six primary harm categories with varying benchmark representation.
Score: 4.3659097510044855
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Various AI safety datasets have been developed to measure LLMs against evolving interpretations of harm. Our evaluation of five recently published open-source safety benchmarks reveals distinct semantic clusters using UMAP dimensionality reduction and kmeans clustering (silhouette score: 0.470). We identify six primary harm categories with varying benchmark representation. GretelAI, for example, focuses heavily on privacy concerns, while WildGuardMix emphasizes self-harm scenarios. Significant differences in prompt length distribution suggests confounds to data collection and interpretations of harm as well as offer possible context. Our analysis quantifies benchmark orthogonality among AI benchmarks, allowing for transparency in coverage gaps despite topical similarities. Our quantitative framework for analyzing semantic orthogonality across safety benchmarks enables more targeted development of datasets that comprehensively address the evolving landscape of harms in AI use, however that is defined in the future.

Related papers

Hoi2Threat: An Interpretable Threat Detection Method for Human Violence Scenarios Guided by Human-Object Interaction [5.188958047067082]
This article proposes a threat detection method based on human-object interaction pairs (HOI-pairs), Hoi2Threat.<n>This method is based on the fine-grained multimodal TD-Hoi dataset, enhancing the model's semantic modeling ability.<n>The experimental results have demonstrated that Hoi2Threat attains substantial enhancement in several threat detection tasks.
arXiv Detail & Related papers (2025-03-13T16:09:51Z)
Nuanced Safety for Generative AI: How Demographics Shape Responsiveness to Severity [28.05638097604126]
We introduce a novel data-driven approach for calibrating granular ratings in pluralistic datasets.<n>We distill non-parametric responsiveness metrics that quantify the consistency of raters in scoring the varying levels of the severity of safety violations.<n>We show that our approach offers improved capabilities for prioritizing safety concerns by capturing nuanced viewpoints across different demographic groups.
arXiv Detail & Related papers (2025-03-07T17:32:31Z)
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications.<n> Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z)
Addressing Key Challenges of Adversarial Attacks and Defenses in the Tabular Domain: A Methodological Framework for Coherence and Consistency [26.645723217188323]
In this paper, we propose new evaluation criteria tailored for adversarial attacks in the tabular domain.<n>We also introduce a novel technique for perturbing dependent features while maintaining coherence and feature consistency within the sample.<n>The findings provide valuable insights on the strengths, limitations, and trade-offs of various adversarial attacks in the tabular domain.
arXiv Detail & Related papers (2024-12-10T09:17:09Z)
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models [55.73581212134293]
We propose a novel, yet elegantly simple approach for detecting adversarial samples in Vision-Language Models. Our method leverages Text-to-Image (T2I) models to generate images based on captions produced by target VLMs. Empirical evaluations conducted on different datasets validate the efficacy of our approach.
arXiv Detail & Related papers (2024-06-13T15:55:04Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information. We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z)
Measuring Adversarial Datasets [28.221635644616523]
Researchers have curated various adversarial datasets for capturing model deficiencies that cannot be revealed in standard benchmark datasets. There is still no methodology to measure the intended and unintended consequences of those adversarial transformations. We conducted a systematic survey of existing quantifiable metrics that describe text instances in NLP tasks.
arXiv Detail & Related papers (2023-11-06T22:08:16Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
Exploring Robustness of Unsupervised Domain Adaptation in Semantic Segmentation [74.05906222376608]
We propose adversarial self-supervision UDA (or ASSUDA) that maximizes the agreement between clean images and their adversarial examples by a contrastive loss in the output space. This paper is rooted in two observations: (i) the robustness of UDA methods in semantic segmentation remains unexplored, which pose a security concern in this field; and (ii) although commonly used self-supervision (e.g., rotation and jigsaw) benefits image tasks such as classification and recognition, they fail to provide the critical supervision signals that could learn discriminative representation for segmentation tasks.
arXiv Detail & Related papers (2021-05-23T01:50:44Z)
Towards Uncovering the Intrinsic Data Structures for Unsupervised Domain Adaptation using Structurally Regularized Deep Clustering [119.88565565454378]
Unsupervised domain adaptation (UDA) is to learn classification models that make predictions for unlabeled data on a target domain. We propose a hybrid model of Structurally Regularized Deep Clustering, which integrates the regularized discriminative clustering of target data with a generative one. Our proposed H-SRDC outperforms all the existing methods under both the inductive and transductive settings.
arXiv Detail & Related papers (2020-12-08T08:52:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.