GMP: A Benchmark for Content Moderation under Co-occurring Violations and Dynamic Rules
- URL: http://arxiv.org/abs/2603.01724v1
- Date: Mon, 02 Mar 2026 10:50:11 GMT
- Title: GMP: A Benchmark for Content Moderation under Co-occurring Violations and Dynamic Rules
- Authors: Houde Dong, Yifei She, Kai Ye, Liangcai Su, Chenxiong Qian, Jie Hao,
- Abstract summary: Large language models (LLMs) are adept at following fixed guidelines, but their judgment capabilities degrade when policies are unstable or context-dependent.<n>This raises a critical question for evaluation: Does high performance on existing static benchmarks truly guarantee robust generalization of AI judgment to real-world scenarios involving co-occurring violations and dynamically changing rules?
- Score: 10.423914922203265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online content moderation is essential for maintaining a healthy digital environment, and reliance on AI for this task continues to grow. Consider a user comment using national stereotypes to insult a politician. This example illustrates two critical challenges in real-world scenarios: (1) Co-occurring Violations, where a single post violates multiple policies (e.g., prejudice and personal attacks); (2) Dynamic rules of moderation, where determination of a violation depends on platform-specific guidelines that evolve across contexts . The intersection of co-occurring harms and dynamically changing rules highlights a core limitation of current AI systems: although large language models (LLMs) are adept at following fixed guidelines, their judgment capabilities degrade when policies are unstable or context-dependent . In practice, such shortcomings lead to inconsistent moderation: either erroneously restricting legitimate expression or allowing harmful content to remain online . This raises a critical question for evaluation: Does high performance on existing static benchmarks truly guarantee robust generalization of AI judgment to real-world scenarios involving co-occurring violations and dynamically changing rules?
Related papers
- Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously [51.03213216886717]
We take the position that the dominant paradigm of General Alignment reaches a structural ceiling in settings with conflicting values.<n>We introduce Edge Alignment as a distinct approach in which systems preserve multi dimensional value structure.
arXiv Detail & Related papers (2026-02-23T16:51:43Z) - Executable Governance for AI: Translating Policies into Rules Using LLMs [1.388831902854619]
Policy-to-Tests (P2T) is a framework that converts natural policy documents into normalized, machine-readable rules.<n>To test the framework beyond a single policy, we apply it across general frameworks, sector guidance, and enterprise standards.<n>These AI-generated rules closely match strong human baselines on span-level and rule-level metrics, with robust inter-annotator agreement on the gold set.
arXiv Detail & Related papers (2025-12-04T03:11:54Z) - Toward Virtuous Reinforcement Learning [1.3428344011390776]
This paper critiques common patterns in machine ethics for Reinforcement Learning (RL)<n>We instead treat ethics as policy level dispositions, that is, relatively stable habits that hold up when incentives, partners, or contexts change.
arXiv Detail & Related papers (2025-12-03T20:30:37Z) - Asking For It: Question-Answering for Predicting Rule Infractions in Online Content Moderation [1.803599876087764]
ModQ is a novel question-answering framework for rule-sensitive content moderation.<n>We implement two model variants and train them on large-scale datasets from Reddit and Lemmy.<n>Both models outperform state-of-the-art baselines in identifying moderation-relevant rule violations.
arXiv Detail & Related papers (2025-10-07T18:11:27Z) - Cognition-of-Thought Elicits Social-Aligned Reasoning in Large Language Models [17.381122321801556]
Large language models (LLMs) excel at complex reasoning but can still exhibit harmful behaviors.<n>This paper introduces Cognition-of-Thought (CooT), a novel decoding-time framework that equips LLMs with an explicit cognitive self-monitoring loop.
arXiv Detail & Related papers (2025-09-27T18:16:57Z) - Customize Multi-modal RAI Guardrails with Precedent-based predictions [55.63757336900865]
A multi-modal guardrail must effectively filter image content based on user-defined policies.<n>Existing fine-tuning methods typically condition predictions on pre-defined policies.<n>We propose to condition model's judgment on "precedents", which are the reasoning processes of prior data points similar to the given input.
arXiv Detail & Related papers (2025-07-28T03:45:34Z) - Few-shot Policy (de)composition in Conversational Question Answering [54.259440408606515]
We propose a neuro-symbolic framework to detect policy compliance using large language models (LLMs) in a few-shot setting.<n>We show that our approach soundly reasons about policy compliance conversations by extracting sub-questions to be answered, assigning truth values from contextual information, and explicitly producing a set of logic statements from the given policies.<n>We apply this approach to the popular PCD and conversational machine reading benchmark, ShARC, and show competitive performance with no task-specific finetuning.
arXiv Detail & Related papers (2025-01-20T08:40:15Z) - Deliberative Alignment: Reasoning Enables Safer Language Models [64.60765108418062]
We introduce Deliberative Alignment, a new paradigm that teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering.<n>We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers.
arXiv Detail & Related papers (2024-12-20T21:00:11Z) - Demarked: A Strategy for Enhanced Abusive Speech Moderation through Counterspeech, Detoxification, and Message Management [71.99446449877038]
We propose a more comprehensive approach called Demarcation scoring abusive speech based on four aspect -- (i) severity scale; (ii) presence of a target; (iii) context scale; (iv) legal scale.
Our work aims to inform future strategies for effectively addressing abusive speech online.
arXiv Detail & Related papers (2024-06-27T21:45:33Z) - CPL-NoViD: Context-Aware Prompt-based Learning for Norm Violation Detection in Online Communities [28.576099654579437]
We introduce Context-aware Prompt-based Learning for Norm Violation Detection (CPL-NoViD)
CPL-NoViD outperforms the baseline by incorporating context through natural language prompts.
It establishes a new state-of-the-art in norm violation detection, surpassing existing benchmarks.
arXiv Detail & Related papers (2023-05-16T23:27:59Z) - Dichotomy of Control: Separating What You Can Control from What You
Cannot [129.62135987416164]
We propose a future-conditioned supervised learning framework that separates mechanisms within a policy's control (actions) from those beyond a policy's control (environmentity)
We show that DoC yields policies that are consistent with their conditioning inputs, ensuring that conditioning a learned policy on a desired high-return future outcome will correctly induce high-return behavior.
arXiv Detail & Related papers (2022-10-24T17:49:56Z) - Non-stationary Online Learning with Memory and Non-stochastic Control [71.14503310914799]
We study the problem of Online Convex Optimization (OCO) with memory, which allows loss functions to depend on past decisions.
In this paper, we introduce dynamic policy regret as the performance measure to design algorithms robust to non-stationary environments.
We propose a novel algorithm for OCO with memory that provably enjoys an optimal dynamic policy regret in terms of time horizon, non-stationarity measure, and memory length.
arXiv Detail & Related papers (2021-02-07T09:45:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.