Small data problems in political research: a critical replication study
- URL: http://arxiv.org/abs/2109.12911v1
- Date: Mon, 27 Sep 2021 09:55:58 GMT
- Title: Small data problems in political research: a critical replication study
- Authors: Hugo de Vos, Suzan Verberne
- Abstract summary: We show that the small data causes the classification model to be highly sensitive to variations in the random train-test split.
We also show that the applied preprocessing causes the data to be extremely sparse.
Based on our findings, we argue that A&W's conclusions regarding the automated classification of organizational reputation tweets can not be maintained.
- Score: 5.698280399449707
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In an often-cited 2019 paper on the use of machine learning in political
research, Anastasopoulos & Whitford (A&W) propose a text classification method
for tweets related to organizational reputation. The aim of their paper was to
provide a 'guide to practice' for public administration scholars and
practitioners on the use of machine learning. In the current paper we follow up
on that work with a replication of A&W's experiments and additional analyses on
model stability and the effects of preprocessing, both in relation to the small
data size. We show that (1) the small data causes the classification model to
be highly sensitive to variations in the random train-test split, and that (2)
the applied preprocessing causes the data to be extremely sparse, with the
majority of items in the data having at most two non-zero lexical features.
With additional experiments in which we vary the steps of the preprocessing
pipeline, we show that the small data size keeps causing problems, irrespective
of the preprocessing choices. Based on our findings, we argue that A&W's
conclusions regarding the automated classification of organizational reputation
tweets -- either substantive or methodological -- can not be maintained and
require a larger data set for training and more careful validation.
Related papers
- A Data-Centric Perspective on Evaluating Machine Learning Models for Tabular Data [9.57464542357693]
This paper demonstrates that model-centric evaluations are biased, as real-world modeling pipelines often require dataset-specific preprocessing and feature engineering.
We select 10 relevant datasets from Kaggle competitions and implement expert-level preprocessing pipelines for each dataset.
After dataset-specific feature engineering, model rankings change considerably, performance differences decrease, and the importance of model selection reduces.
arXiv Detail & Related papers (2024-07-02T09:54:39Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Are fairness metric scores enough to assess discrimination biases in
machine learning? [4.073786857780967]
We focus on the Bios dataset, and our learning task is to predict the occupation of individuals, based on their biography.
We address an important limitation of theoretical discussions dealing with group-wise fairness metrics: they focus on large datasets.
We then question how reliable are different popular measures of bias when the size of the training set is simply sufficient to learn reasonably accurate predictions.
arXiv Detail & Related papers (2023-06-08T15:56:57Z) - A Survey of Learning on Small Data: Generalization, Optimization, and
Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI.
This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data.
Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z) - Representation Bias in Data: A Survey on Identification and Resolution
Techniques [26.142021257838564]
Data-driven algorithms are only as good as the data they work with, while data sets, especially social data, often fail to represent minorities adequately.
Representation Bias in data can happen due to various reasons ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods.
This paper reviews the literature on identifying and resolving representation bias as a feature of a data set, independent of how consumed later.
arXiv Detail & Related papers (2022-03-22T16:30:22Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Machine Identification of High Impact Research through Text and Image
Analysis [0.4737991126491218]
We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations.
Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions.
arXiv Detail & Related papers (2020-05-20T19:12:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.