Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities
- URL: http://arxiv.org/abs/2509.02926v1
- Date: Wed, 03 Sep 2025 01:27:26 GMT
- Title: Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities
- Authors: Youngwoo Kim, Himanshu Beniwal, Steven L. Johnson, Thomas Hartvigsen,
- Abstract summary: We represent moderation criteria as score tables of lexical expressions associated with content removal.<n>Our experiments demonstrate that these extracted lexical patterns effectively replicate the performance of neural moderation models.<n>The resulting criteria matrix reveals significant variations in how seemingly shared norms are actually enforced.
- Score: 11.963784232069907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective content moderation systems require explicit classification criteria, yet online communities like subreddits often operate with diverse, implicit standards. This work introduces a novel approach to identify and extract these implicit criteria from historical moderation data using an interpretable architecture. We represent moderation criteria as score tables of lexical expressions associated with content removal, enabling systematic comparison across different communities. Our experiments demonstrate that these extracted lexical patterns effectively replicate the performance of neural moderation models while providing transparent insights into decision-making processes. The resulting criteria matrix reveals significant variations in how seemingly shared norms are actually enforced, uncovering previously undocumented moderation patterns including community-specific tolerances for language, features for topical restrictions, and underlying subcategories of the toxic speech classification.
Related papers
- Asking For It: Question-Answering for Predicting Rule Infractions in Online Content Moderation [1.803599876087764]
ModQ is a novel question-answering framework for rule-sensitive content moderation.<n>We implement two model variants and train them on large-scale datasets from Reddit and Lemmy.<n>Both models outperform state-of-the-art baselines in identifying moderation-relevant rule violations.
arXiv Detail & Related papers (2025-10-07T18:11:27Z) - SCORE: A Semantic Evaluation Framework for Generative Document Parsing [2.5101597298392098]
Multi-modal generative document parsing systems produce semantically correct yet structurally divergent outputs.<n>Conventional metrics-CER, WER, IoU, or TEDS-misclassify such diversity as error, penalizing valid interpretations and obscuring system behavior.<n>We introduce SCORE, an interpretation-agnostic framework that integrates (i) adjusted edit distance for robust content fidelity, (ii) token-level diagnostics to distinguish hallucinations from omissions, (iii) table evaluation with spatial tolerance and semantic alignment, and (iv) hierarchy-aware consistency checks.
arXiv Detail & Related papers (2025-09-16T16:06:19Z) - Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.<n> Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z) - A Collaborative Content Moderation Framework for Toxicity Detection based on Conformalized Estimates of Annotation Disagreement [7.031062446301277]
We introduce a novel content moderation framework that emphasizes the importance of capturing annotation disagreement.<n>We leverage uncertainty estimation techniques, specifically Conformal Prediction, to account for both the ambiguity in comment annotations and the model's inherent uncertainty in predicting toxicity and disagreement.
arXiv Detail & Related papers (2024-11-06T18:08:57Z) - SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding [52.98133831401225]
Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence.
We propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo.
We introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries.
arXiv Detail & Related papers (2024-07-06T16:08:17Z) - ToVo: Toxicity Taxonomy via Voting [25.22398575368979]
We propose a dataset creation mechanism that integrates voting and chain-of-thought processes.<n>Our methodology ensures diverse classification metrics for each sample.<n>We utilize the dataset created through our proposed mechanism to train our model.
arXiv Detail & Related papers (2024-06-21T02:35:30Z) - Rule By Example: Harnessing Logical Rules for Explainable Hate Speech
Detection [13.772240348963303]
Rule By Example (RBE) is a novel-based contrastive learning approach for learning from logical rules for the task of textual content moderation.
RBE is capable of providing rule-grounded predictions, allowing for more explainable and customizable predictions compared to typical deep learning-based approaches.
arXiv Detail & Related papers (2023-07-24T16:55:37Z) - Mitigating Catastrophic Forgetting in Task-Incremental Continual
Learning with Adaptive Classification Criterion [50.03041373044267]
We propose a Supervised Contrastive learning framework with adaptive classification criterion for Continual Learning.
Experiments show that CFL achieves state-of-the-art performance and has a stronger ability to overcome compared with the classification baselines.
arXiv Detail & Related papers (2023-05-20T19:22:40Z) - LexSubCon: Integrating Knowledge from Lexical Resources into Contextual
Embeddings for Lexical Substitution [76.615287796753]
We introduce LexSubCon, an end-to-end lexical substitution framework based on contextual embedding models.
This is achieved by combining contextual information with knowledge from structured lexical resources.
Our experiments show that LexSubCon outperforms previous state-of-the-art methods on LS07 and CoInCo benchmark datasets.
arXiv Detail & Related papers (2021-07-11T21:25:56Z) - Challenges in Automated Debiasing for Toxic Language Detection [81.04406231100323]
Biased associations have been a challenge in the development of classifiers for detecting toxic language.
We investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection.
Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English)
arXiv Detail & Related papers (2021-01-29T22:03:17Z) - Compressive Summarization with Plausibility and Salience Modeling [54.37665950633147]
We propose to relax the rigid syntactic constraints on candidate spans and instead leave compression decisions to two data-driven criteria: plausibility and salience.
Our method achieves strong in-domain results on benchmark summarization datasets, and human evaluation shows that the plausibility model generally selects for grammatical and factual deletions.
arXiv Detail & Related papers (2020-10-15T17:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.