Related papers: Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms

Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms

URL: http://arxiv.org/abs/2211.06516v1
Date: Fri, 11 Nov 2022 23:55:53 GMT
Title: Bandits for Online Calibration: An Application to Content Moderation on Social Media Platforms
Authors: Vashist Avadhanula, Omar Abdul Baki, Hamsa Bastani, Osbert Bastani, Caner Gocmen, Daniel Haimovich, Darren Hwang, Dima Karamshuk, Thomas Leeper, Jiayuan Ma, Gregory Macnamara, Jake Mullett, Christopher Palow, Sung Park, Varun S Rajagopal, Kevin Schaeffer, Parikshit Shah, Deeksha Sinha, Nicolas Stier-Moses, Peng Xu
Abstract summary: We describe the current content moderation strategy employed by Meta to remove policy-violating content from its platforms. We use both handcrafted and learned risk models to flag potentially violating content for human review. Our approach aggregates these risk models into a single ranking score, calibrating them to prioritize more reliable risk models.
Score: 14.242221219862849
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We describe the current content moderation strategy employed by Meta to remove policy-violating content from its platforms. Meta relies on both handcrafted and learned risk models to flag potentially violating content for human review. Our approach aggregates these risk models into a single ranking score, calibrating them to prioritize more reliable risk models. A key challenge is that violation trends change over time, affecting which risk models are most reliable. Our system additionally handles production challenges such as changing risk models and novel risk models. We use a contextual bandit to update the calibration in response to such trends. Our approach increases Meta's top-line metric for measuring the effectiveness of its content moderation strategy by 13%.

Related papers

Safety Pretraining: Toward the Next Generation of Safe AI [61.2816320807586]
We present a data-centric pretraining framework that builds safety into the model from the start. Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date, generated via recontextualization of harmful web data; and (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content.
arXiv Detail & Related papers (2025-04-23T17:58:08Z)
Re-ranking Using Large Language Models for Mitigating Exposure to Harmful Content on Social Media Platforms [10.421660174482314]
We propose a novel re-ranking approach using Large Language Models (LLMs) in zero-shot and few-shot settings. Our method dynamically assesses and re-ranks content sequences, effectively mitigating harmful content exposure without requiring extensive labeled data.
arXiv Detail & Related papers (2025-01-23T00:26:32Z)
Towards Safer Social Media Platforms: Scalable and Performant Few-Shot Harmful Content Moderation Using Large Language Models [9.42299478071576]
Harmful content on social media platforms poses significant risks to users and society. Current approaches rely on human moderators, supervised classifiers, and large volumes of training data. We utilize Large Language Models (LLMs) to undertake few-shot dynamic content moderation via in-context learning.
arXiv Detail & Related papers (2025-01-23T00:19:14Z)
Optimal Classification under Performative Distribution Shift [13.508249764979075]
We propose a novel view in which performative effects are modelled as push-forward measures. We prove the convexity of the performative risk under a new set of assumptions. We also establish a connection with adversarially robust classification by reformulating the minimization of the performative risk as a min-max variational problem.
arXiv Detail & Related papers (2024-11-04T12:20:13Z)
Let Community Rules Be Reflected in Online Content Moderation [2.4717834653693083]
This study proposes a community rule-based content moderation framework. It integrates community rules into the moderation of user-generated content. In particular, incorporating community rules substantially enhances model performance in content moderation.
arXiv Detail & Related papers (2024-08-21T23:38:02Z)
ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2. Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z)
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models [74.05368440735468]
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases.
arXiv Detail & Related papers (2024-06-26T05:36:23Z)
Decision Mamba: A Multi-Grained State Space Model with Self-Evolution Regularization for Offline RL [57.202733701029594]
Decision Mamba is a novel multi-grained state space model with a self-evolving policy learning strategy. To mitigate the overfitting issue on noisy trajectories, a self-evolving policy is proposed by using progressive regularization. The policy evolves by using its own past knowledge to refine the suboptimal actions, thus enhancing its robustness on noisy demonstrations.
arXiv Detail & Related papers (2024-06-08T10:12:00Z)
Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models [112.48136829374741]
In this paper, we unveil a new vulnerability: the privacy backdoor attack. When a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model. Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
arXiv Detail & Related papers (2024-04-01T16:50:54Z)
IMMA: Immunizing text-to-image Models against Malicious Adaptation [11.912092139018885]
Open-sourced text-to-image models and fine-tuning methods have led to the increasing risk of malicious adaptation, i.e., fine-tuning to generate harmful/unauthorized content. We propose to immunize'' the model by learning model parameters that are difficult for the adaptation methods when fine-tuning malicious content; in short IMMA. Empirical results show IMMA's effectiveness against malicious adaptations, including mimicking the artistic style and learning of inappropriate/unauthorized content.
arXiv Detail & Related papers (2023-11-30T18:55:16Z)
Improved Membership Inference Attacks Against Language Classification Models [0.0]
We present a novel framework for running membership inference attacks against classification models. We show that this approach achieves higher accuracy than either a single attack model or an attack model per class label.
arXiv Detail & Related papers (2023-10-11T06:09:48Z)
Mutual Information Regularized Offline Reinforcement Learning [76.05299071490913]
We propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. We introduce 3 different variants of MISA, and empirically demonstrate that tighter mutual information lower bound gives better offline RL performance.
arXiv Detail & Related papers (2022-10-14T03:22:43Z)
Reliable Decision from Multiple Subtasks through Threshold Optimization: Content Moderation in the Wild [7.176020195419459]
Social media platforms struggle to protect users from harmful content through content moderation. These platforms have recently leveraged machine learning models to cope with the vast amount of user-generated content daily. Third-party content moderation services provide prediction scores of multiple subtasks, such as predicting the existence of underage personnel, rude gestures, or weapons. We introduce a simple yet effective threshold optimization method that searches the optimal thresholds of the multiple subtasks to make a reliable moderation decision in a cost-effective way.
arXiv Detail & Related papers (2022-08-16T03:51:43Z)
MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data. We show that an existing model-based RL algorithm already produces significant gains in the offline setting. We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.