On Cross-Dataset Generalization in Automatic Detection of Online Abuse
- URL: http://arxiv.org/abs/2010.07414v3
- Date: Wed, 19 May 2021 18:47:03 GMT
- Title: On Cross-Dataset Generalization in Automatic Detection of Online Abuse
- Authors: Isar Nejadgholi and Svetlana Kiritchenko
- Abstract summary: We show that the benign examples in the Wikipedia Detox dataset are biased towards platform-specific topics.
We identify these examples using unsupervised topic modeling and manual inspection of topics' keywords.
For a robust dataset design, we suggest applying inexpensive unsupervised methods to inspect the collected data and downsize the non-generalizable content.
- Score: 7.163723138100273
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: NLP research has attained high performances in abusive language detection as
a supervised classification task. While in research settings, training and test
datasets are usually obtained from similar data samples, in practice systems
are often applied on data that are different from the training set in topic and
class distributions. Also, the ambiguity in class definitions inherited in this
task aggravates the discrepancies between source and target datasets. We
explore the topic bias and the task formulation bias in cross-dataset
generalization. We show that the benign examples in the Wikipedia Detox dataset
are biased towards platform-specific topics. We identify these examples using
unsupervised topic modeling and manual inspection of topics' keywords. Removing
these topics increases cross-dataset generalization, without reducing in-domain
classification performance. For a robust dataset design, we suggest applying
inexpensive unsupervised methods to inspect the collected data and downsize the
non-generalizable content before manually annotating for class labels.
Related papers
- Towards Weakly-Supervised Hate Speech Classification Across Datasets [47.101942709219784]
We show the effectiveness of a state-of-the-art weakly-supervised text classification model in various in-dataset and cross-dataset settings.
We also conduct an in-depth quantitative and qualitative analysis of the source of poor generalizability of HS classification models.
arXiv Detail & Related papers (2023-05-04T08:15:40Z) - Metadata Archaeology: Unearthing Data Subsets by Leveraging Training
Dynamics [3.9627732117855414]
We focus on providing a unified and efficient framework for Metadata Archaeology.
We curate different subsets of data that might exist in a dataset.
We leverage differences in learning dynamics between these probe suites to infer metadata of interest.
arXiv Detail & Related papers (2022-09-20T21:52:39Z) - Automatic universal taxonomies for multi-domain semantic segmentation [1.4364491422470593]
Training semantic segmentation models on multiple datasets has sparked a lot of recent interest in the computer vision community.
established datasets have mutually incompatible labels which disrupt principled inference in the wild.
We address this issue by automatic construction of universal through iterative dataset integration.
arXiv Detail & Related papers (2022-07-18T08:53:17Z) - Identifying the Context Shift between Test Benchmarks and Production
Data [1.2259552039796024]
There exists a performance gap between machine learning models' accuracy on dataset benchmarks and real-world production data.
We outline two methods for identifying changes in context that lead to distribution shifts and model prediction errors.
We present two case-studies to highlight the implicit assumptions underlying applied machine learning models that tend to lead to errors.
arXiv Detail & Related papers (2022-07-03T14:54:54Z) - Classification of datasets with imputed missing values: does imputation
quality matter? [2.7646249774183]
Classifying samples in incomplete datasets is non-trivial.
We demonstrate how the commonly used measures for assessing quality are flawed.
We propose a new class of discrepancy scores which focus on how well the method recreates the overall distribution of the data.
arXiv Detail & Related papers (2022-06-16T22:58:03Z) - Learning Debiased and Disentangled Representations for Semantic
Segmentation [52.35766945827972]
We propose a model-agnostic and training scheme for semantic segmentation.
By randomly eliminating certain class information in each training iteration, we effectively reduce feature dependencies among classes.
Models trained with our approach demonstrate strong results on multiple semantic segmentation benchmarks.
arXiv Detail & Related papers (2021-10-31T16:15:09Z) - Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare.
In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples.
To tackle this problem, we build a robust one-class classification framework via data refinement.
We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - Summary-Source Proposition-level Alignment: Task, Datasets and
Supervised Baseline [94.0601799665342]
Aligning sentences in a reference summary with their counterparts in source documents was shown as a useful auxiliary summarization task.
We propose establishing summary-source alignment as an explicit task, while introducing two major novelties.
We create a novel training dataset for proposition-level alignment, derived automatically from available summarization evaluation data.
We present a supervised proposition alignment baseline model, showing improved alignment-quality over the unsupervised approach.
arXiv Detail & Related papers (2020-09-01T17:27:12Z) - Automatically Discovering and Learning New Visual Categories with
Ranking Statistics [145.89790963544314]
We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes.
We learn a general-purpose clustering model and use the latter to identify the new classes in the unlabelled data.
We evaluate our approach on standard classification benchmarks and outperform current methods for novel category discovery by a significant margin.
arXiv Detail & Related papers (2020-02-13T18:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.