Multi-characteristic Subject Selection from Biased Datasets
- URL: http://arxiv.org/abs/2012.10311v1
- Date: Fri, 18 Dec 2020 15:55:27 GMT
- Title: Multi-characteristic Subject Selection from Biased Datasets
- Authors: Tahereh Arabghalizi, Alexandros Labrinidis
- Abstract summary: We present a constrained optimization-based method that finds the best possible sampling fractions for the different population subgroups.
Our results show that our proposed method outperforms the baselines for all problem variations by up to 90%.
- Score: 79.82881947891589
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Subject selection plays a critical role in experimental studies, especially
ones with human subjects. Anecdotal evidence suggests that many such studies,
done at or near university campus settings suffer from selection bias, i.e.,
the too-many-college-kids-as-subjects problem. Unfortunately, traditional
sampling techniques, when applied over biased data, will typically return
biased results. In this paper, we tackle the problem of multi-characteristic
subject selection from biased datasets. We present a constrained
optimization-based method that finds the best possible sampling fractions for
the different population subgroups, based on the desired sampling fractions
provided by the researcher running the subject selection.We perform an
extensive experimental study, using a variety of real datasets. Our results
show that our proposed method outperforms the baselines for all problem
variations by up to 90%.
Related papers
- Diversified Batch Selection for Training Acceleration [68.67164304377732]
A prevalent research line, known as online batch selection, explores selecting informative subsets during the training process.
vanilla reference-model-free methods involve independently scoring and selecting data in a sample-wise manner.
We propose Diversified Batch Selection (DivBS), which is reference-model-free and can efficiently select diverse and representative samples.
arXiv Detail & Related papers (2024-06-07T12:12:20Z) - From Random to Informed Data Selection: A Diversity-Based Approach to
Optimize Human Annotation and Few-Shot Learning [38.30983556062276]
A major challenge in Natural Language Processing is obtaining annotated data for supervised learning.
Crowdsourcing introduces issues related to the annotator's experience, consistency, and biases.
This paper contributes an automatic and informed data selection architecture to build a small dataset for few-shot learning.
arXiv Detail & Related papers (2024-01-24T04:57:32Z) - Hybrid Sample Synthesis-based Debiasing of Classifier in Limited Data
Setting [5.837881923712393]
This paper focuses on a more practical setting with no prior information about the bias.
In this setting, there are a large number of bias-aligned samples that cause the model to produce biased predictions.
If the training data is limited, the influence of the bias-aligned samples may become even stronger on the model predictions.
arXiv Detail & Related papers (2023-12-13T17:04:16Z) - Approximating Counterfactual Bounds while Fusing Observational, Biased
and Randomised Data Sources [64.96984404868411]
We address the problem of integrating data from multiple, possibly biased, observational and interventional studies.
We show that the likelihood of the available data has no local maxima.
We then show how the same approach can address the general case of multiple datasets.
arXiv Detail & Related papers (2023-07-31T11:28:24Z) - SMoA: Sparse Mixture of Adapters to Mitigate Multiple Dataset Biases [27.56143777363971]
We propose a new debiasing method Sparse Mixture-of-Adapters (SMoA), which can mitigate multiple dataset biases effectively and efficiently.
Experiments on Natural Language Inference and Paraphrase Identification tasks demonstrate that SMoA outperforms full-finetuning, adapter tuning baselines, and prior strong debiasing methods.
arXiv Detail & Related papers (2023-02-28T08:47:20Z) - Feature-Level Debiased Natural Language Understanding [86.8751772146264]
Existing natural language understanding (NLU) models often rely on dataset biases to achieve high performance on specific datasets.
We propose debiasing contrastive learning (DCT) to mitigate biased latent features and neglect the dynamic nature of bias.
DCT outperforms state-of-the-art baselines on out-of-distribution datasets while maintaining in-distribution performance.
arXiv Detail & Related papers (2022-12-11T06:16:14Z) - Representation Bias in Data: A Survey on Identification and Resolution
Techniques [26.142021257838564]
Data-driven algorithms are only as good as the data they work with, while data sets, especially social data, often fail to represent minorities adequately.
Representation Bias in data can happen due to various reasons ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods.
This paper reviews the literature on identifying and resolving representation bias as a feature of a data set, independent of how consumed later.
arXiv Detail & Related papers (2022-03-22T16:30:22Z) - Source data selection for out-of-domain generalization [0.76146285961466]
Poor selection of a source dataset can lead to poor performance on the target.
We propose two source selection methods that are based on the multi-bandit theory and random search.
Our proposals can be viewed as diagnostics for the existence of a reweighted source subsamples that perform better than the random selection of available samples.
arXiv Detail & Related papers (2022-02-04T14:37:31Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Improving Multi-Turn Response Selection Models with Complementary
Last-Utterance Selection by Instance Weighting [84.9716460244444]
We consider utilizing the underlying correlation in the data resource itself to derive different kinds of supervision signals.
We conduct extensive experiments in two public datasets and obtain significant improvement in both datasets.
arXiv Detail & Related papers (2020-02-18T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.