Explainable Abuse Detection as Intent Classification and Slot Filling
- URL: http://arxiv.org/abs/2210.02659v1
- Date: Thu, 6 Oct 2022 03:33:30 GMT
- Title: Explainable Abuse Detection as Intent Classification and Slot Filling
- Authors: Agostina Calabrese, Bj\"orn Ross, Mirella Lapata
- Abstract summary: We introduce the concept of policy-aware abuse detection, abandoning the unrealistic expectation that systems can reliably learn which phenomena constitute abuse from inspecting the data alone.
We show how architectures for intent classification and slot filling can be used for abuse detection, while providing a rationale for model decisions.
- Score: 66.80201541759409
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To proactively offer social media users a safe online experience, there is a
need for systems that can detect harmful posts and promptly alert platform
moderators. In order to guarantee the enforcement of a consistent policy,
moderators are provided with detailed guidelines. In contrast, most
state-of-the-art models learn what abuse is from labelled examples and as a
result base their predictions on spurious cues, such as the presence of group
identifiers, which can be unreliable. In this work we introduce the concept of
policy-aware abuse detection, abandoning the unrealistic expectation that
systems can reliably learn which phenomena constitute abuse from inspecting the
data alone. We propose a machine-friendly representation of the policy that
moderators wish to enforce, by breaking it down into a collection of intents
and slots. We collect and annotate a dataset of 3,535 English posts with such
slots, and show how architectures for intent classification and slot filling
can be used for abuse detection, while providing a rationale for model
decisions.
Related papers
- Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation [86.05704141217036]
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs.
We introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection.
arXiv Detail & Related papers (2024-06-28T17:05:46Z) - "Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models [74.05368440735468]
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs)
In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases.
arXiv Detail & Related papers (2024-06-26T05:36:23Z) - The Unappreciated Role of Intent in Algorithmic Moderation of Social Media Content [2.2618341648062477]
This paper examines the role of intent in content moderation systems.
We review state of the art detection models and benchmark training datasets for online abuse to assess their awareness and ability to capture intent.
arXiv Detail & Related papers (2024-05-17T18:05:13Z) - Cream Skimming the Underground: Identifying Relevant Information Points
from Online Forums [0.16252563723817934]
This paper proposes a machine learning-based approach for detecting the exploitation of vulnerabilities in the wild by monitoring underground hacking forums.
We develop a supervised machine learning model that can filter threads citing CVEs and label them as Proof-of-Concept, Weaponization, or Exploitation.
arXiv Detail & Related papers (2023-08-03T16:52:42Z) - Rule By Example: Harnessing Logical Rules for Explainable Hate Speech
Detection [13.772240348963303]
Rule By Example (RBE) is a novel-based contrastive learning approach for learning from logical rules for the task of textual content moderation.
RBE is capable of providing rule-grounded predictions, allowing for more explainable and customizable predictions compared to typical deep learning-based approaches.
arXiv Detail & Related papers (2023-07-24T16:55:37Z) - Measuring Re-identification Risk [72.6715574626418]
We present a new theoretical framework to measure re-identification risk in compact user representations.
Our framework formally bounds the probability that an attacker may be able to obtain the identity of a user from their representation.
We show how our framework is general enough to model important real-world applications such as the Chrome's Topics API for interest-based advertising.
arXiv Detail & Related papers (2023-04-12T16:27:36Z) - Canary in a Coalmine: Better Membership Inference with Ensembled
Adversarial Queries [53.222218035435006]
We use adversarial tools to optimize for queries that are discriminative and diverse.
Our improvements achieve significantly more accurate membership inference than existing methods.
arXiv Detail & Related papers (2022-10-19T17:46:50Z) - Semantic Novelty Detection via Relational Reasoning [17.660958043781154]
We propose a novel representation learning paradigm based on relational reasoning.
Our experiments show that this knowledge is directly transferable to a wide range of scenarios.
It can be exploited as a plug-and-play module to convert closed-set recognition models into reliable open-set ones.
arXiv Detail & Related papers (2022-07-18T15:49:27Z) - Pattern Learning for Detecting Defect Reports and Improvement Requests
in App Reviews [4.460358746823561]
In this study, we follow novel approaches that target this absence of actionable insights by classifying reviews as defect reports and requests for improvement.
We employ a supervised system that is capable of learning lexico-semantic patterns through genetic programming.
We show that the automatically learned patterns outperform the manually created ones, to be generated.
arXiv Detail & Related papers (2020-04-19T08:13:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.