Related papers: An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software

An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software

URL: http://arxiv.org/abs/2308.09810v1
Date: Fri, 18 Aug 2023 20:33:06 GMT
Title: An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software
Authors: Wenxuan Wang, Jingyuan Huang, Jen-tse Huang, Chang Chen, Jiazhen Gu, Pinjia He, Michael R. Lyu
Abstract summary: Social media platforms are being increasingly misused to spread toxic content, including hate speech, malicious advertising, and pornography. Despite tremendous efforts in developing and deploying content moderation methods, malicious users can evade moderation by embedding texts into images. We propose a metamorphic testing framework for content moderation software.
Score: 64.367830425115
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The exponential growth of social media platforms has brought about a revolution in communication and content dissemination in human society. Nevertheless, these platforms are being increasingly misused to spread toxic content, including hate speech, malicious advertising, and pornography, leading to severe negative consequences such as harm to teenagers' mental health. Despite tremendous efforts in developing and deploying textual and image content moderation methods, malicious users can evade moderation by embedding texts into images, such as screenshots of the text, usually with some interference. We find that modern content moderation software's performance against such malicious inputs remains underexplored. In this work, we propose OASIS, a metamorphic testing framework for content moderation software. OASIS employs 21 transform rules summarized from our pilot study on 5,000 real-world toxic contents collected from 4 popular social media applications, including Twitter, Instagram, Sina Weibo, and Baidu Tieba. Given toxic textual contents, OASIS can generate image test cases, which preserve the toxicity yet are likely to bypass moderation. In the evaluation, we employ OASIS to test five commercial textual content moderation software from famous companies (i.e., Google Cloud, Microsoft Azure, Baidu Cloud, Alibaba Cloud and Tencent Cloud), as well as a state-of-the-art moderation research model. The results show that OASIS achieves up to 100% error finding rates. Moreover, through retraining the models with the test cases generated by OASIS, the robustness of the moderation model can be improved without performance degradation.

Related papers

Automated Harmfulness Testing for Code Large Language Models [5.847020442165636]
Generative AI systems powered by Large Language Models (LLMs) usually use content moderation to prevent harmful content spread. Exposure to harmful content in software artifacts can negatively impact the mental health of developers. We propose a coverage-guided harmfulness testing framework that generates prompts using diverse transformations and harmful keywords injected into benign programs.
arXiv Detail & Related papers (2025-03-20T23:06:06Z)
Advancing Content Moderation: Evaluating Large Language Models for Detecting Sensitive Content Across Text, Images, and Videos [0.1399948157377307]
Governments, educators, and parents are often at odds with media platforms about how to regulate, control, and limit the spread of such content. Techniques from natural language processing and computer vision have been used widely to automatically identify and filter out sensitive content. More sophisticated algorithms for understanding the context of both text and image may open rooms for improvement in content censorship.
arXiv Detail & Related papers (2024-11-26T05:29:18Z)
EmoAttack: Emotion-to-Image Diffusion Models for Emotional Backdoor Generation [48.95229349072138]
We investigate a previously overlooked risk associated with text-to-image diffusion models, that is, utilizing emotion in the input texts to introduce negative content and provoke unfavorable emotions in users. Specifically, we identify a new backdoor attack, i.e., emotion-aware backdoor attack (EmoAttack) Unlike existing personalization methods, our approach fine-tunes a pre-trained diffusion model by establishing a mapping between a cluster of emotional words and a given reference image containing malicious negative content.
arXiv Detail & Related papers (2024-06-22T14:43:23Z)
A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works. Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement. We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z)
Content Moderation on Social Media in the EU: Insights From the DSA Transparency Database [0.0]
Digital Services Act (DSA) requires large social media platforms in the EU to provide clear and specific information whenever they restrict access to certain content. Statements of Reasons (SoRs) are collected in the DSA Transparency Database to ensure transparency and scrutiny of content moderation decisions. We empirically analyze 156 million SoRs within an observation period of two months to provide an early look at content moderation decisions of social media platforms in the EU.
arXiv Detail & Related papers (2023-12-07T16:56:19Z)
Understanding writing style in social media with a supervised contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation. We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts. Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z)
DeepfakeArt Challenge: A Benchmark Dataset for Generative AI Art Forgery and Data Poisoning Detection [57.51313366337142]
There has been growing concern over the use of generative AI for malicious purposes. In the realm of visual content synthesis using generative AI, key areas of significant concern has been image forgery and data poisoning. We introduce the DeepfakeArt Challenge, a large-scale challenge benchmark dataset designed specifically to aid in the building of machine learning algorithms for generative AI art forgery and data poisoning detection.
arXiv Detail & Related papers (2023-06-02T05:11:27Z)
Validating Multimedia Content Moderation Software via Semantic Fusion [16.322773343799575]
We introduce Semantic Fusion, a general, effective methodology for validating multimedia content moderation software. We employ DUO to test five commercial content moderation software and two state-of-the-art models against three kinds of toxic content. The results show that DUO achieves up to 100% error finding rate (EFR) when testing moderation software.
arXiv Detail & Related papers (2023-05-23T02:44:15Z)
Harnessing the Power of Text-image Contrastive Models for Automatic Detection of Online Misinformation [50.46219766161111]
We develop a self-learning model to explore the constrastive learning in the domain of misinformation identification. Our model shows the superior performance of non-matched image-text pair detection when the training data is insufficient.
arXiv Detail & Related papers (2023-04-19T02:53:59Z)
MTTM: Metamorphic Testing for Textual Content Moderation Software [11.759353169546646]
Social media platforms have been increasingly exploited to propagate toxic content. malicious users can evade moderation by changing only a few words in the toxic content. We propose MTTM, a Metamorphic Testing framework for Textual content Moderation software.
arXiv Detail & Related papers (2023-02-11T14:44:39Z)
Countering Malicious Content Moderation Evasion in Online Social Networks: Simulation and Detection of Word Camouflage [64.78260098263489]
Twisting and camouflaging keywords are among the most used techniques to evade platform content moderation systems. This article contributes significantly to countering malicious information by developing multilingual tools to simulate and detect new methods of evasion of content.
arXiv Detail & Related papers (2022-12-27T16:08:49Z)
WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans [2.4737119633827174]
In recent years, the widespread use of social media has led to an increase in the generation of toxic and offensive content on online platforms. Social media platforms have worked on developing automatic detection methods and employing human moderators to cope with this deluge of offensive content.
arXiv Detail & Related papers (2021-04-09T22:52:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.