Least Squares Estimation For Hierarchical Data
- URL: http://arxiv.org/abs/2404.13164v2
- Date: Wed, 04 Jun 2025 12:13:43 GMT
- Title: Least Squares Estimation For Hierarchical Data
- Authors: Ryan Cumings-Menon, Pavel Zhuravlev,
- Abstract summary: The U.S. Census Bureau's 2020 Disclosure Avoidance System (DAS) bases its output on noisy measurements.<n>This paper describes an algorithm that can leverage a hierarchical structure of the input data in order to compute very high dimensional least squares estimates.
- Score: 0.7366405857677227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The U.S. Census Bureau's 2020 Disclosure Avoidance System (DAS) bases its output on noisy measurements, which are population tabulations added to realizations of mean-zero random variables. These noisy measurements are observed for a set of hierarchical geographic units, e.g., the U.S. as a whole, states, counties, census tracts, and census blocks. The Census Bureau released the noisy measurements generated in the DAS executions for the two primary 2020 Census data products, in part to allow data users to assess uncertainty in 2020 Census tabulations introduced by disclosure avoidance. This paper describes an algorithm that can leverage a hierarchical structure of the input data in order to compute very high dimensional least squares estimates in a computationally efficient manner. Afterward, we show that this algorithm's output is equal to the generalized least squares estimator, describe how to find the variance of linear functions of this estimator, and provide a numerical experiment in which we compute confidence intervals of 2010 Census tabulations based on this estimator. We also describe an accompanying Census Bureau experimental data product that applies this estimator to the publicly available noisy measurements to provide data users with the inputs required to estimate confidence intervals for all tabulations that were included in one of the two main 2020 Census data products, i.e., the 2020 Redistricting Data Product, in the US, state, county, and census tract geographic levels.
Related papers
- Benchmarking Fraud Detectors on Private Graph Data [70.4654745317714]
Currently, many types of fraud are managed in part by automated detection algorithms that operate over graphs.<n>We consider the scenario where a data holder wishes to outsource development of fraud detectors to third parties.<n>Third parties submit their fraud detectors to the data holder, who evaluates these algorithms on a private dataset and then publicly communicates the results.<n>We propose a realistic privacy attack on this system that allows an adversary to de-anonymize individuals' data based only on the evaluation results.
arXiv Detail & Related papers (2025-07-30T03:20:15Z) - The NetMob25 Dataset: A High-resolution Multi-layered View of Individual Mobility in Greater Paris Region [64.30214722988666]
This paper describes the survey design, collection protocol, processing methodology, and characteristics of the released dataset.<n>The dataset includes three components: (i) an Individuals database describing demographic, socioeconomic, and household characteristics; (ii) a Trips database with over 80,000 annotated displacements including timestamps, transport modes, and trip purposes; and (iii) a Raw GPS Traces database comprising about 500 million high-frequency points.
arXiv Detail & Related papers (2025-06-06T09:22:21Z) - PHSafe: Disclosure Avoidance for the 2020 Census Supplemental Demographic and Housing Characteristics File (S-DHC) [7.7544849165583525]
The article describes the PHSafe algorithm, which is based on adding noise drawn from a discrete Gaussian distribution to the statistics of interest.<n>We prove that the algorithm satisfies a well-studied variant of differential privacy, called zero-concentrated differential privacy.
arXiv Detail & Related papers (2025-05-02T13:20:32Z) - SafeTab-P: Disclosure Avoidance for the 2020 Census Detailed Demographic and Housing Characteristics File A (Detailed DHC-A) [7.787555954397617]
The article describes the disclosure avoidance algorithm that the U.S. Census Bureau used to protect the Detailed Demographic and Housing Characteristics File A (DHC-A) of the 2020 Census.<n>The SafeTab-P algorithm is based on adding noise drawn to statistics of interest from a discrete Gaussian distribution.<n>We prove that the algorithm satisfies a well-studied variant of differential privacy, called zero-concentrated differential privacy (zCDP)
arXiv Detail & Related papers (2025-05-02T13:08:28Z) - Scalable Private Partition Selection via Adaptive Weighting [66.09199304818928]
In a private set union, users hold subsets of items from an unbounded universe.
The goal is to output as many items as possible from the union of the users' sets while maintaining user-level differential privacy.
We propose an algorithm for this problem, MaximumDegree (MAD), which adaptively reroutes weight from items with weight far above the threshold needed for privacy to items with smaller weight.
arXiv Detail & Related papers (2025-02-13T01:27:11Z) - Best Linear Unbiased Estimate from Privatized Histograms [6.17477133700348]
In differential privacy (DP) mechanisms, it can be beneficial to release "redundant" outputs.
We show that the minimum variance processing is a linear projection.
We propose the Scalable Algorithm Efficient for Best Linear Unbiased Estimate (SEA BLUE)
arXiv Detail & Related papers (2024-09-06T16:27:34Z) - Noisy Measurements Are Important, the Design of Census Products Is Much More Important [1.52292571922932]
McCartan et al. (2023) call for "making differential privacy work for census data users"
This commentary explains why the 2020 Census Noisy Measurement Files (NMFs) are not the best focus for that plea.
arXiv Detail & Related papers (2023-12-20T15:43:04Z) - Disclosure Avoidance for the 2020 Census Demographic and Housing Characteristics File [7.664548801662584]
We describe the concepts and methods used by the Disclosure Avoidance System (DAS) to produce formally private output in support of the 2020 Census data product releases.
We describe the updates to the DAS that were required to release the Demographic and Housing Characteristics (DHC) File.
We also describe subsequent experimental data products to facilitate development of tools that provide confidence intervals for confidential 2020 Census tabulations.
arXiv Detail & Related papers (2023-12-18T00:54:04Z) - Robust Statistical Comparison of Random Variables with Locally Varying
Scale of Measurement [0.562479170374811]
Spaces with locally varying scale of measurement, like multidimensional structures with differently scaled dimensions, are pretty common in statistics and machine learning.
We address this problem by considering an order based on (sets of) expectations of random variables mapping into such non-standard spaces.
This order contains dominance and expectation order as extreme cases when no, or respectively perfect, cardinal structure is given.
arXiv Detail & Related papers (2023-06-22T11:02:18Z) - A Meta-Learning Approach to Predicting Performance and Data Requirements [163.4412093478316]
We propose an approach to estimate the number of samples required for a model to reach a target performance.
We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset.
We introduce a novel piecewise power law (PPL) that handles the two data differently.
arXiv Detail & Related papers (2023-03-02T21:48:22Z) - Concrete Score Matching: Generalized Score Matching for Discrete Data [109.12439278055213]
"Concrete score" is a generalization of the (Stein) score for discrete settings.
"Concrete Score Matching" is a framework to learn such scores from samples.
arXiv Detail & Related papers (2022-11-02T00:41:37Z) - Compact Redistricting Plans Have Many Spanning Trees [39.779544988993294]
In the design and analysis of political redistricting maps, it is often useful to be able to sample from the space of all partitions of the graph of census blocks into connected subgraphs of equal population.
In this paper, we establish an inverse exponential relationship between the total length of the boundaries separating districts and the probability that such a map will be sampled.
arXiv Detail & Related papers (2021-09-27T23:36:01Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Fewer is More: A Deep Graph Metric Learning Perspective Using Fewer
Proxies [65.92826041406802]
We propose a Proxy-based deep Graph Metric Learning approach from the perspective of graph classification.
Multiple global proxies are leveraged to collectively approximate the original data points for each class.
We design a novel reverse label propagation algorithm, by which the neighbor relationships are adjusted according to ground-truth labels.
arXiv Detail & Related papers (2020-10-26T14:52:42Z) - Distribution Matching for Crowd Counting [51.90971145453012]
We show that imposing Gaussians to annotations hurts generalization performance.
We propose to use Distribution Matching for crowd COUNTing (DM-Count)
In terms of Mean Absolute Error, DM-Count outperforms the previous state-of-the-art methods.
arXiv Detail & Related papers (2020-09-28T04:57:23Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Magnify Your Population: Statistical Downscaling to Augment the Spatial
Resolution of Socioeconomic Census Data [48.7576911714538]
We present a new statistical downscaling approach to derive fine-scale estimates of key socioeconomic attributes.
For each selected socioeconomic variable, a Random Forest model is trained on the source Census units and then used to generate fine-scale gridded predictions.
As a case study, we apply this method to Census data in the United States, downscaling the selected socioeconomic variables available at the block group level, to a grid of 300 spatial resolution.
arXiv Detail & Related papers (2020-06-23T16:52:18Z) - NWPU-Crowd: A Large-Scale Benchmark for Crowd Counting and Localization [101.13851473792334]
We construct a large-scale congested crowd counting and localization dataset, NWPU-Crowd, consisting of 5,109 images, in a total of 2,133,375 annotated heads with points and boxes.
Compared with other real-world datasets, it contains various illumination scenes and has the largest density range (020,033)
We describe the data characteristics, evaluate the performance of some mainstream state-of-the-art (SOTA) methods, and analyze the new problems that arise on the new data.
arXiv Detail & Related papers (2020-01-10T09:26:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.