Related papers: balance -- a Python package for balancing biased data samples

balance -- a Python package for balancing biased data samples

URL: http://arxiv.org/abs/2307.06024v2
Date: Thu, 13 Jul 2023 09:48:45 GMT
Title: balance -- a Python package for balancing biased data samples
Authors: Tal Sarig, Tal Galili, Roee Eilat
Abstract summary: We present balance, an open-source Python package by Meta, offering a simple workflow for analyzing and adjusting biased data samples. The package provides a simple API that can be used by researchers and data scientists from a wide range of fields on a variety of data.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Surveys are an important research tool, providing unique measurements on subjective experiences such as sentiment and opinions that cannot be measured by other means. However, because survey data is collected from a self-selected group of participants, directly inferring insights from it to a population of interest, or training ML models on such data, can lead to erroneous estimates or under-performing models. In this paper we present balance, an open-source Python package by Meta, offering a simple workflow for analyzing and adjusting biased data samples with respect to a population of interest. The balance workflow includes three steps: understanding the initial bias in the data relative to a target we would like to infer, adjusting the data to correct for the bias by producing weights for each unit in the sample based on propensity scores, and evaluating the final biases and the variance inflation after applying the fitted weights. The package provides a simple API that can be used by researchers and data scientists from a wide range of fields on a variety of data. The paper provides the relevant context, methodological background, and presents the package's API.

Related papers

Bias Begins with Data: The FairGround Corpus for Robust and Reproducible Research on Algorithmic Fairness [42.93319580186729]
Machine learning (ML) systems are increasingly adopted in high-stakes decision-making domains.<n>At the core of fair ML research are the datasets used to investigate bias and develop mitigation strategies.<n>We present FairGround: a unified framework, data corpus, and Python package aimed at advancing reproducible research.
arXiv Detail & Related papers (2025-10-25T16:48:33Z)
DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Revisiting the Dataset Bias Problem from a Statistical Perspective [72.94990819287551]
We study the "dataset bias" problem from a statistical standpoint. We identify the main cause of the problem as the strong correlation between a class attribute u and a non-class attribute b. We propose to mitigate dataset bias via either weighting the objective of each sample n by frac1p(u_n|b_n) or sampling that sample with a weight proportional to frac1p(u_n|b_n).
arXiv Detail & Related papers (2024-02-05T22:58:06Z)
Utilizing dataset affinity prediction in object detection to assess training data [4.508868068781057]
We show the benefits of the so-called dataset affinity score by automatically selecting samples from a heterogeneous pool of vehicle datasets. The results show that object detectors can be trained on a significantly sparser set of training samples without losing detection accuracy.
arXiv Detail & Related papers (2023-11-16T10:45:32Z)
IBADR: an Iterative Bias-Aware Dataset Refinement Framework for Debiasing NLU models [52.03761198830643]
We propose IBADR, an Iterative Bias-Aware dataset Refinement framework. We first train a shallow model to quantify the bias degree of samples in the pool. Then, we pair each sample with a bias indicator representing its bias degree, and use these extended samples to train a sample generator. In this way, this generator can effectively learn the correspondence relationship between bias indicators and samples.
arXiv Detail & Related papers (2023-11-01T04:50:38Z)
Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data. We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations. Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z)
Quantifying Human Bias and Knowledge to guide ML models during Training [0.0]
We introduce an experimental approach to dealing with skewed datasets by including humans in the training process. We ask humans to rank the importance of features of the dataset, and through rank aggregation, determine the initial weight bias for the model. We show that collective human bias can allow ML models to learn insights about the true population instead of the biased sample.
arXiv Detail & Related papers (2022-11-19T20:49:07Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
A Novel Dataset for Evaluating and Alleviating Domain Shift for Human Detection in Agricultural Fields [59.035813796601055]
We evaluate the impact of domain shift on human detection models trained on well known object detection datasets when deployed on data outside the distribution of the training set. We introduce the OpenDR Humans in Field dataset, collected in the context of agricultural robotics applications, using the Robotti platform.
arXiv Detail & Related papers (2022-09-27T07:04:28Z)
Mitigating Dataset Bias by Using Per-sample Gradient [9.290757451344673]
We propose PGD (Per-sample Gradient-based Debiasing), that comprises three steps: training a model on uniform batch sampling, setting the importance of each sample in proportion to the norm of the sample gradient, and training the model using importance-batch sampling. Compared with existing baselines for various synthetic and real-world datasets, the proposed method showed state-of-the-art accuracy for a the classification task.
arXiv Detail & Related papers (2022-05-31T11:41:02Z)
Sampling Bias Correction for Supervised Machine Learning: A Bayesian Inference Approach with Practical Applications [0.0]
We discuss a problem where a dataset might be subject to intentional sample bias such as label imbalance. We then apply this solution to binary logistic regression, and discuss scenarios where a dataset might be subject to intentional sample bias. This technique is widely applicable for statistical inference on big data, from the medical sciences to image recognition to marketing.
arXiv Detail & Related papers (2022-03-11T20:46:37Z)
Sampling To Improve Predictions For Underrepresented Observations In Imbalanced Data [0.0]
Data imbalance negatively impacts the predictive performance of models on underrepresented observations. We propose sampling to adjust for this imbalance with the goal of improving the performance of models trained on historical production data. We apply our methods on a large biopharmaceutical manufacturing data set from an advanced simulation of penicillin production.
arXiv Detail & Related papers (2021-11-17T12:16:54Z)
The Gap on GAP: Tackling the Problem of Differing Data Distributions in Bias-Measuring Datasets [58.53269361115974]
Diagnostic datasets that can detect biased models are an important prerequisite for bias reduction within natural language processing. undesired patterns in the collected data can make such tests incorrect. We introduce a theoretically grounded method for weighting test samples to cope with such patterns in the test data.
arXiv Detail & Related papers (2020-11-03T16:50:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.