Enabling Contextual Soft Moderation on Social Media through Contrastive Textual Deviation
- URL: http://arxiv.org/abs/2407.20910v1
- Date: Tue, 30 Jul 2024 15:37:05 GMT
- Title: Enabling Contextual Soft Moderation on Social Media through Contrastive Textual Deviation
- Authors: Pujan Paudel, Mohammad Hammas Saeed, Rebecca Auger, Chris Wells, Gianluca Stringhini,
- Abstract summary: We propose to incorporate stance detection into existing automated soft-moderation pipelines.
We show that our approach can reduce contextual false positives from 20% to 2.1%.
- Score: 11.577310745082894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated soft moderation systems are unable to ascertain if a post supports or refutes a false claim, resulting in a large number of contextual false positives. This limits their effectiveness, for example undermining trust in health experts by adding warnings to their posts or resorting to vague warnings instead of granular fact-checks, which result in desensitizing users. In this paper, we propose to incorporate stance detection into existing automated soft-moderation pipelines, with the goal of ruling out contextual false positives and providing more precise recommendations for social media content that should receive warnings. We develop a textual deviation task called Contrastive Textual Deviation (CTD) and show that it outperforms existing stance detection approaches when applied to soft moderation.We then integrate CTD into the stateof-the-art system for automated soft moderation Lambretta, showing that our approach can reduce contextual false positives from 20% to 2.1%, providing another important building block towards deploying reliable automated soft moderation tools on social media.
Related papers
- TaeBench: Improving Quality of Toxic Adversarial Examples [10.768188905349874]
This paper proposes an annotation pipeline for quality control of generated toxic adversarial examples (TAE)
We design model-based automated annotation and human-based quality verification to assess the quality requirements of TAE.
We show that TaeBench can effectively transfer-attack SOTA toxicity content moderation models and services.
arXiv Detail & Related papers (2024-10-08T00:14:27Z) - Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-Identification [4.082799056366928]
Whistleblowing is essential for ensuring transparency and accountability in both public and private sectors.
Legal measures, such as the EU WBD, are limited in their scope and effectiveness.
Current text sanitization tools follow a one-size-fits-all approach and take an overly limited view of anonymity.
arXiv Detail & Related papers (2024-05-02T08:52:29Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News
Detection [50.07850264495737]
"Prompt-and-Align" (P&A) is a novel prompt-based paradigm for few-shot fake news detection.
We show that P&A sets new states-of-the-art for few-shot fake news detection performance by significant margins.
arXiv Detail & Related papers (2023-09-28T13:19:43Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Explainable Abuse Detection as Intent Classification and Slot Filling [66.80201541759409]
We introduce the concept of policy-aware abuse detection, abandoning the unrealistic expectation that systems can reliably learn which phenomena constitute abuse from inspecting the data alone.
We show how architectures for intent classification and slot filling can be used for abuse detection, while providing a rationale for model decisions.
arXiv Detail & Related papers (2022-10-06T03:33:30Z) - Automated Detection of Doxing on Twitter [3.463438487417909]
Doxing refers to the practice of disclosing sensitive personal information about a person without their consent.
We propose and evaluate a set of approaches for automatically detecting second- and third-party disclosures on Twitter of sensitive private information.
arXiv Detail & Related papers (2022-02-02T05:04:34Z) - Repairing Adversarial Texts through Perturbation [11.65808514109149]
It is known that neural networks are subject to attacks through adversarial perturbations.
adversarial perturbation is still possible after applying mitigation methods such as adversarial training.
We propose an approach to automatically repair adversarial texts at runtime.
arXiv Detail & Related papers (2021-12-29T03:57:02Z) - Sample-Efficient Safety Assurances using Conformal Prediction [57.92013073974406]
Early warning systems can provide alerts when an unsafe situation is imminent.
To reliably improve safety, these warning systems should have a provable false negative rate.
We present a framework that combines a statistical inference technique known as conformal prediction with a simulator of robot/environment dynamics.
arXiv Detail & Related papers (2021-09-28T23:00:30Z) - Towards Robust Speech-to-Text Adversarial Attack [78.5097679815944]
This paper introduces a novel adversarial algorithm for attacking the state-of-the-art speech-to-text systems, namely DeepSpeech, Kaldi, and Lingvo.
Our approach is based on developing an extension for the conventional distortion condition of the adversarial optimization formulation.
Minimizing over this metric, which measures the discrepancies between original and adversarial samples' distributions, contributes to crafting signals very close to the subspace of legitimate speech recordings.
arXiv Detail & Related papers (2021-03-15T01:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.