Strategic Filtering for Content Moderation: Free Speech or Free of Distortion?
- URL: http://arxiv.org/abs/2507.20061v1
- Date: Sat, 26 Jul 2025 21:04:19 GMT
- Title: Strategic Filtering for Content Moderation: Free Speech or Free of Distortion?
- Authors: Saba Ahmadi, Avrim Blum, Haifeng Xu, Fan Yao,
- Abstract summary: User-generated content (UGC) on social media platforms is vulnerable to incitements and manipulations.<n>We aim at optimizing the trade-off between minimizing social distortion and maximizing free speech.
- Score: 41.59893570633978
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: User-generated content (UGC) on social media platforms is vulnerable to incitements and manipulations, necessitating effective regulations. To address these challenges, those platforms often deploy automated content moderators tasked with evaluating the harmfulness of UGC and filtering out content that violates established guidelines. However, such moderation inevitably gives rise to strategic responses from users, who strive to express themselves within the confines of guidelines. Such phenomena call for a careful balance between: 1. ensuring freedom of speech -- by minimizing the restriction of expression; and 2. reducing social distortion -- measured by the total amount of content manipulation. We tackle the problem of optimizing this balance through the lens of mechanism design, aiming at optimizing the trade-off between minimizing social distortion and maximizing free speech. Although determining the optimal trade-off is NP-hard, we propose practical methods to approximate the optimal solution. Additionally, we provide generalization guarantees determining the amount of finite offline data required to approximate the optimal moderator effectively.
Related papers
- Exterior Penalty Policy Optimization with Penalty Metric Network under Constraints [52.37099916582462]
In Constrained Reinforcement Learning (CRL), agents explore the environment to learn the optimal policy while satisfying constraints.
We propose a theoretically guaranteed penalty function method, Exterior Penalty Policy Optimization (EPO), with adaptive penalties generated by a Penalty Metric Network (PMN)
PMN responds appropriately to varying degrees of constraint violations, enabling efficient constraint satisfaction and safe exploration.
arXiv Detail & Related papers (2024-07-22T10:57:32Z) - Demarked: A Strategy for Enhanced Abusive Speech Moderation through Counterspeech, Detoxification, and Message Management [71.99446449877038]
We propose a more comprehensive approach called Demarcation scoring abusive speech based on four aspect -- (i) severity scale; (ii) presence of a target; (iii) context scale; (iv) legal scale.
Our work aims to inform future strategies for effectively addressing abusive speech online.
arXiv Detail & Related papers (2024-06-27T21:45:33Z) - Self-Evolution Fine-Tuning for Policy Optimization [22.629113943131294]
We introduce self-evolution fine-tuning (SEFT) for policy optimization.
SEFT eliminates the need for annotated samples while retaining the stability and efficiency of supervised fine-tuning.
One of the prominent features of this method is its ability to leverage unlimited amounts of unannotated data for policy optimization.
arXiv Detail & Related papers (2024-06-16T06:38:02Z) - Content-Agnostic Moderation for Stance-Neutral Recommendation [13.210645250173997]
Content-agnostic moderation does not rely on the actual content being moderated, arguably making it less prone to forms of censorship.
We introduce two novel content-agnostic moderation methods that modify the recommendations from the content recommender to disperse user-item co-clusters without relying on content features.
Our results indicate that achieving stance neutrality without direct content information is not only feasible but can also help in developing more balanced and informative recommendation systems without substantially degrading user engagement.
arXiv Detail & Related papers (2024-05-29T09:50:39Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Resilient Constrained Reinforcement Learning [87.4374430686956]
We study a class of constrained reinforcement learning (RL) problems in which multiple constraint specifications are not identified before study.
It is challenging to identify appropriate constraint specifications due to the undefined trade-off between the reward training objective and the constraint satisfaction.
We propose a new constrained RL approach that searches for policy and constraint specifications together.
arXiv Detail & Related papers (2023-12-28T18:28:23Z) - Reliable Decision from Multiple Subtasks through Threshold Optimization:
Content Moderation in the Wild [7.176020195419459]
Social media platforms struggle to protect users from harmful content through content moderation.
These platforms have recently leveraged machine learning models to cope with the vast amount of user-generated content daily.
Third-party content moderation services provide prediction scores of multiple subtasks, such as predicting the existence of underage personnel, rude gestures, or weapons.
We introduce a simple yet effective threshold optimization method that searches the optimal thresholds of the multiple subtasks to make a reliable moderation decision in a cost-effective way.
arXiv Detail & Related papers (2022-08-16T03:51:43Z) - Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response.
We construct unbiased estimators for the policy-dependent estimand by a perturbation method.
We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z) - Disinformation, Stochastic Harm, and Costly Filtering: A Principal-Agent
Analysis of Regulating Social Media Platforms [2.9747815715612713]
The spread of disinformation on social media platforms such as Facebook is harmful to society.
filtering disinformation is costly, not only for implementing filtering algorithms or employing manual filtering effort.
Since the costs of harmful content are borne by other entities, the platform has no incentive to filter at a socially-optimal level.
arXiv Detail & Related papers (2021-06-17T23:27:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.