A Guide to Misinformation Detection Datasets
- URL: http://arxiv.org/abs/2411.05060v1
- Date: Thu, 07 Nov 2024 18:47:39 GMT
- Title: A Guide to Misinformation Detection Datasets
- Authors: Camille Thibault, Gabrielle Peloquin-Skulski, Jacob-Junqi Tian, Florence Laflamme, Yuxiang Guan, Reihaneh Rabbany, Jean-François Godbout, Kellin Pelrine,
- Abstract summary: This guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations.
All datasets and other artifacts are available at https://misinfo-datasets.complexdatalab.com/.
- Score: 5.673951146506489
- License:
- Abstract: Misinformation is a complex societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this problem, we have curated the largest collection of (mis)information datasets in the literature, totaling 75. From these, we evaluated the quality of all of the 36 datasets that consist of statements or claims. We assess these datasets to identify those with solid foundations for empirical work and those with flaws that could result in misleading and non-generalizable results, such as insufficient label quality, spurious correlations, or political bias. We further provide state-of-the-art baselines on all these datasets, but show that regardless of label quality, categorical labels may no longer give an accurate evaluation of detection model performance. We discuss alternatives to mitigate this problem. Overall, this guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations, ultimately improving research in misinformation detection. All datasets and other artifacts are available at https://misinfo-datasets.complexdatalab.com/.
Related papers
- Attribute-Based Semantic Type Detection and Data Quality Assessment [0.5735035463793008]
This research introduces an innovative methodology centered around Attribute-Based Semantic Type Detection and Data Quality Assessment.
By leveraging semantic information within attribute labels, combined with rule-based analysis and comprehensive Formats and Abbreviations dictionaries, our approach introduces a practical semantic type classification system.
A comparative analysis with Sherlock, a state-of-the-art Semantic Type Detection system, shows the advantages of our approach.
arXiv Detail & Related papers (2024-10-04T09:22:44Z) - Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings.
Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research.
This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z) - Investigating the Quality of DermaMNIST and Fitzpatrick17k
Dermatological Image Datasets [19.128392861461297]
We conduct meticulous analyses of two popular dermatological image datasets: DermaMNIST and Fitzpatrick17k.
We uncover data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets.
arXiv Detail & Related papers (2024-01-25T20:29:01Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - A Survey of Dataset Refinement for Problems in Computer Vision Datasets [11.45536223418548]
Large-scale datasets have played a crucial role in the advancement of computer vision.
They often suffer from problems such as class imbalance, noisy labels, dataset bias, or high resource costs.
Various data-centric solutions have been proposed to solve the dataset problems.
They improve the quality of datasets by re-organizing them, which we call dataset refinement.
arXiv Detail & Related papers (2022-10-21T03:58:43Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Assessing the Quality of the Datasets by Identifying Mislabeled Samples [14.881597737762316]
We propose a novel statistic -- noise score -- as a measure for the quality of each data point to identify mislabeled samples.
In our work, we use the representations derived by the inference network of data quality supervised variational autoencoder (AQUAVS)
We validate our proposed statistic through experimentation by corrupting MNIST, FashionMNIST, and CIFAR10/100 datasets.
arXiv Detail & Related papers (2021-09-10T17:14:09Z) - Statistical Learning to Operationalize a Domain Agnostic Data Quality
Scoring [8.864453148536061]
The research study provides an automated platform which takes an incoming dataset and metadata to provide the DQ score, report and label.
The results of this study would be useful to data scientists as the value of this quality label would instill confidence before deploying the data for his/her respective practical application.
arXiv Detail & Related papers (2021-08-16T12:20:57Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - GraspNet: A Large-Scale Clustered and Densely Annotated Dataset for
Object Grasping [49.777649953381676]
We contribute a large-scale grasp pose detection dataset with a unified evaluation system.
Our dataset contains 87,040 RGBD images with over 370 million grasp poses.
arXiv Detail & Related papers (2019-12-31T18:15:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.