Related papers: GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators

GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators

URL: http://arxiv.org/abs/2601.03273v1
Date: Mon, 22 Dec 2025 14:49:28 GMT
Title: GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators
Authors: Naseem Machlovi, Maryam Saleki, Ruhul Amin, Mohamed Rahouti, Shawqi Al-Maliki, Junaid Qadir, Mohamed M. Abdallah, Ala Al-Fuqaha,
Abstract summary: We present GuardEval, a benchmark dataset for training and evaluation of large language models (LLMs)<n>We also present GemmaGuard (GGuard), a fine-tuned version of Gemma3-12B trained on GuardEval, to assess content moderation with fine-grained labels.<n>We show that multi-perspective, human-centered safety benchmarks are critical for reducing biased and inconsistent moderation decisions.
Score: 9.212268642636007
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As large language models (LLMs) become deeply embedded in daily life, the urgent need for safer moderation systems, distinguishing between naive from harmful requests while upholding appropriate censorship boundaries, has never been greater. While existing LLMs can detect harmful or unsafe content, they often struggle with nuanced cases such as implicit offensiveness, subtle gender and racial biases, and jailbreak prompts, due to the subjective and context-dependent nature of these issues. Furthermore, their heavy reliance on training data can reinforce societal biases, resulting in inconsistent and ethically problematic outputs. To address these challenges, we introduce GuardEval, a unified multi-perspective benchmark dataset designed for both training and evaluation, containing 106 fine-grained categories spanning human emotions, offensive and hateful language, gender and racial bias, and broader safety concerns. We also present GemmaGuard (GGuard), a QLoRA fine-tuned version of Gemma3-12B trained on GuardEval, to assess content moderation with fine-grained labels. Our evaluation shows that GGuard achieves a macro F1 score of 0.832, substantially outperforming leading moderation models, including OpenAI Moderator (0.64) and Llama Guard (0.61). We show that multi-perspective, human-centered safety benchmarks are critical for reducing biased and inconsistent moderation decisions. GuardEval and GGuard together demonstrate that diverse, representative data materially improve safety, fairness, and robustness on complex, borderline cases.

Related papers

Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context [82.32380418146656]
Health-ORSC-Bench is the first large-scale benchmark designed to measure textbfOver-Refusal and textbfSafe Completion quality in healthcare.<n>Our framework uses an automated pipeline with human validation to test models at varying levels of intent ambiguity.<n>Health-ORSC-Bench provides a rigorous standard for calibrating the next generation of medical AI assistants.
arXiv Detail & Related papers (2026-01-25T01:28:52Z)
ProGuard: Towards Proactive Multimodal Safeguard [48.89789547707647]
ProGuard is a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks.<n>We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories.<n>We then train our vision-language base model purely through reinforcement learning to achieve efficient and concise reasoning.
arXiv Detail & Related papers (2025-12-29T16:13:23Z)
AprielGuard [2.3704817495377526]
Existing tools treat safety risks as separate problems, limiting robustness and generalizability.<n>We introduce AprielGuard, an 8B parameter safeguard model that unify these dimensions within a single taxonomy and learning framework.<n> AprielGuard achieves strong performance in detecting harmful content and adversarial manipulations.
arXiv Detail & Related papers (2025-12-23T12:01:32Z)
Qwen3Guard Technical Report [127.69960525219051]
We present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants.<n>Generative Qwen3Guard casts safety classification as an instruction-following task to enable fine-grained tri-class judgments.<n>Stream Qwen3Guard introduces a token-level classification head for real-time safety monitoring.
arXiv Detail & Related papers (2025-10-16T04:00:18Z)
DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models [59.45605332033458]
Safety mechanisms can backfire, causing over-refusal, where models decline benign requests out of excessive caution.<n>No existing benchmark has systematically addressed over-refusal in the visual modality.<n>This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content.
arXiv Detail & Related papers (2025-10-12T23:21:34Z)
IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement [35.904652937034136]
We introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning.<n>We show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios.
arXiv Detail & Related papers (2025-08-27T16:47:31Z)
XGUARD: A Graded Benchmark for Evaluating Safety Failures of Large Language Models on Extremist Content [3.4303348430261202]
We present XGUARD, a benchmark and evaluation framework designed to assess the severity of extremist content generated by Large Language Models (LLMs)<n>XGUARD includes 3,840 red teaming prompts sourced from real world data such as social media and news, covering a broad range of ideologically charged scenarios.<n>Our framework categorizes model responses into five danger levels (0 to 4), enabling a more nuanced analysis of both the frequency and severity of failures.
arXiv Detail & Related papers (2025-06-01T11:48:54Z)
BingoGuard: LLM Content Moderation Tools with Risk Levels [67.53167973090356]
Malicious content generated by large language models (LLMs) can pose varying degrees of harm.<n>In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system.
arXiv Detail & Related papers (2025-03-09T10:43:09Z)
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs [54.10865585773691]
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety.<n>WildGuard achieves three goals: identifying malicious intent in user prompts, detecting safety risks of model responses, and determining model refusal rate.
arXiv Detail & Related papers (2024-06-26T16:58:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.