Common Misconceptions about Population Data
- URL: http://arxiv.org/abs/2112.10912v1
- Date: Mon, 20 Dec 2021 23:54:49 GMT
- Title: Common Misconceptions about Population Data
- Authors: Peter Christen and Rainer Schnell
- Abstract summary: This article discusses a diverse range of misconceptions about population data that we believe anybody who works with such data needs to be aware of.
The massive size of such databases is often mistaken as a guarantee for valid inferences on the population of interest.
We conclude with a set of recommendations for inference when using population data.
- Score: 5.606904856295946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Databases covering all individuals of a population are increasingly used for
research studies in domains ranging from public health to the social sciences.
There is also growing interest by governments and businesses to use population
data to support data-driven decision making. The massive size of such databases
is often mistaken as a guarantee for valid inferences on the population of
interest. However, population data have characteristics that make them
challenging to use, including various assumptions being made how such data were
collected and what types of processing have been applied to them. Furthermore,
the full potential of population data can often only be unlocked when such data
are linked to other databases, a process that adds fresh challenges. This
article discusses a diverse range of misconceptions about population data that
we believe anybody who works with such data needs to be aware of. Many of these
misconceptions are not well documented in scientific publications but only
discussed anecdotally among researchers and practitioners. We conclude with a
set of recommendations for inference when using population data.
Related papers
- Differentially Private Data Release on Graphs: Inefficiencies and Unfairness [48.96399034594329]
This paper characterizes the impact of Differential Privacy on bias and unfairness in the context of releasing information about networks.
We consider a network release problem where the network structure is known to all, but the weights on edges must be released privately.
Our work provides theoretical foundations and empirical evidence into the bias and unfairness arising due to privacy in these networked decision problems.
arXiv Detail & Related papers (2024-08-08T08:37:37Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - BiasBuster: a Neural Approach for Accurate Estimation of Population
Statistics using Biased Location Data [6.077198822448429]
We show that statistical debiasing, although in some cases useful, often fails to improve accuracy.
We then propose BiasBuster, a neural network approach that utilizes the correlations between population statistics and location characteristics to provide accurate estimates of population statistics.
arXiv Detail & Related papers (2024-02-17T16:16:24Z) - Mean Estimation with User-level Privacy under Data Heterogeneity [54.07947274508013]
Different users may possess vastly different numbers of data points.
It cannot be assumed that all users sample from the same underlying distribution.
We propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data.
arXiv Detail & Related papers (2023-07-28T23:02:39Z) - Releasing survey microdata with exact cluster locations and additional
privacy safeguards [77.34726150561087]
We propose an alternative microdata dissemination strategy that leverages the utility of the original microdata with additional privacy safeguards.
Our strategy reduces the respondents' re-identification risk for any number of disclosed attributes by 60-80% even under re-identification attempts.
arXiv Detail & Related papers (2022-05-24T19:37:11Z) - Pseudo-PFLOW: Development of nationwide synthetic open dataset for
people movement based on limited travel survey and open statistical data [4.243926243206826]
People flow data are utilized in diverse fields such as urban and commercial planning and disaster management.
This study developed pseudo-people-flow data covering all of Japan by combining public statistical and travel survey data.
arXiv Detail & Related papers (2022-05-02T05:13:53Z) - So2Sat POP -- A Curated Benchmark Data Set for Population Estimation
from Space on a Continental Scale [11.38584315242023]
We provide a comprehensive data set for population estimation in 98 European cities.
The data set comprises a digital elevation model, local climate zone, land use proportions, nighttime lights in combination with multi-spectral Sentinel-2 imagery, and data from the Open Street Map initiative.
arXiv Detail & Related papers (2022-04-07T07:30:43Z) - Assessing the Quality of Gridded Population Data for Quantifying the
Population Living in Deprived Communities [0.0]
In 2014 on average 65% of the urban population lived in slums.
Most of the data about slums comes from census data, which is only available at aggregate levels and often excludes these settlements.
We evaluate the accuracy of the WorldPOP and LandScan population layers against ground-truth data composed of 1,703 georeferenced polygons.
arXiv Detail & Related papers (2020-11-25T18:14:30Z) - Leveraging Administrative Data for Bias Audits: Assessing Disparate
Coverage with Mobility Data for COVID-19 Policy [61.60099467888073]
We show how linking administrative data can enable auditing mobility data for bias.
We show that older and non-white voters are less likely to be captured by mobility data.
We show that allocating public health resources based on such mobility data could disproportionately harm high-risk elderly and minority groups.
arXiv Detail & Related papers (2020-11-14T02:04:14Z) - Bayesian Quantile Matching Estimation [4.56877715768796]
Research and scientific understanding, e.g. for medical diagnostics or policy advice, often relies on data access.
We propose a Bayesian method for learning from quantile information.
arXiv Detail & Related papers (2020-08-14T15:39:51Z) - Measuring Social Biases of Crowd Workers using Counterfactual Queries [84.10721065676913]
Social biases based on gender, race, etc. have been shown to pollute machine learning (ML) pipeline predominantly via biased training datasets.
Crowdsourcing, a popular cost-effective measure to gather labeled training datasets, is not immune to the inherent social biases of crowd workers.
We propose a new method based on counterfactual fairness to quantify the degree of inherent social bias in each crowd worker.
arXiv Detail & Related papers (2020-04-04T21:41:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.