Bandits for Online Calibration: An Application to Content Moderation on
Social Media Platforms
- URL: http://arxiv.org/abs/2211.06516v1
- Date: Fri, 11 Nov 2022 23:55:53 GMT
- Title: Bandits for Online Calibration: An Application to Content Moderation on
Social Media Platforms
- Authors: Vashist Avadhanula, Omar Abdul Baki, Hamsa Bastani, Osbert Bastani,
Caner Gocmen, Daniel Haimovich, Darren Hwang, Dima Karamshuk, Thomas Leeper,
Jiayuan Ma, Gregory Macnamara, Jake Mullett, Christopher Palow, Sung Park,
Varun S Rajagopal, Kevin Schaeffer, Parikshit Shah, Deeksha Sinha, Nicolas
Stier-Moses, Peng Xu
- Abstract summary: We describe the current content moderation strategy employed by Meta to remove policy-violating content from its platforms.
We use both handcrafted and learned risk models to flag potentially violating content for human review.
Our approach aggregates these risk models into a single ranking score, calibrating them to prioritize more reliable risk models.
- Score: 14.242221219862849
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We describe the current content moderation strategy employed by Meta to
remove policy-violating content from its platforms. Meta relies on both
handcrafted and learned risk models to flag potentially violating content for
human review. Our approach aggregates these risk models into a single ranking
score, calibrating them to prioritize more reliable risk models. A key
challenge is that violation trends change over time, affecting which risk
models are most reliable. Our system additionally handles production challenges
such as changing risk models and novel risk models. We use a contextual bandit
to update the calibration in response to such trends. Our approach increases
Meta's top-line metric for measuring the effectiveness of its content
moderation strategy by 13%.
Related papers
- Optimal Classification under Performative Distribution Shift [13.508249764979075]
We propose a novel view in which performative effects are modelled as push-forward measures.
We prove the convexity of the performative risk under a new set of assumptions.
We also establish a connection with adversarially robust classification by reformulating the minimization of the performative risk as a min-max variational problem.
arXiv Detail & Related papers (2024-11-04T12:20:13Z) - Let Community Rules Be Reflected in Online Content Moderation [2.4717834653693083]
This study proposes a community rule-based content moderation framework.
It integrates community rules into the moderation of user-generated content.
In particular, incorporating community rules substantially enhances model performance in content moderation.
arXiv Detail & Related papers (2024-08-21T23:38:02Z) - ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2.
Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z) - "Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models [74.05368440735468]
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs)
In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases.
arXiv Detail & Related papers (2024-06-26T05:36:23Z) - Decision Mamba: A Multi-Grained State Space Model with Self-Evolution Regularization for Offline RL [57.202733701029594]
Decision Mamba is a novel multi-grained state space model with a self-evolving policy learning strategy.
To mitigate the overfitting issue on noisy trajectories, a self-evolving policy is proposed by using progressive regularization.
The policy evolves by using its own past knowledge to refine the suboptimal actions, thus enhancing its robustness on noisy demonstrations.
arXiv Detail & Related papers (2024-06-08T10:12:00Z) - Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models [112.48136829374741]
In this paper, we unveil a new vulnerability: the privacy backdoor attack.
When a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model.
Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
arXiv Detail & Related papers (2024-04-01T16:50:54Z) - IMMA: Immunizing text-to-image Models against Malicious Adaptation [11.912092139018885]
Open-sourced text-to-image models and fine-tuning methods have led to the increasing risk of malicious adaptation, i.e., fine-tuning to generate harmful/unauthorized content.
We propose to immunize'' the model by learning model parameters that are difficult for the adaptation methods when fine-tuning malicious content; in short IMMA.
Empirical results show IMMA's effectiveness against malicious adaptations, including mimicking the artistic style and learning of inappropriate/unauthorized content.
arXiv Detail & Related papers (2023-11-30T18:55:16Z) - Improved Membership Inference Attacks Against Language Classification Models [0.0]
We present a novel framework for running membership inference attacks against classification models.
We show that this approach achieves higher accuracy than either a single attack model or an attack model per class label.
arXiv Detail & Related papers (2023-10-11T06:09:48Z) - Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset.
We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset.
We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z) - Reliable Decision from Multiple Subtasks through Threshold Optimization:
Content Moderation in the Wild [7.176020195419459]
Social media platforms struggle to protect users from harmful content through content moderation.
These platforms have recently leveraged machine learning models to cope with the vast amount of user-generated content daily.
Third-party content moderation services provide prediction scores of multiple subtasks, such as predicting the existence of underage personnel, rude gestures, or weapons.
We introduce a simple yet effective threshold optimization method that searches the optimal thresholds of the multiple subtasks to make a reliable moderation decision in a cost-effective way.
arXiv Detail & Related papers (2022-08-16T03:51:43Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.