Small Area Estimation with Random Forests and the LASSO
- URL: http://arxiv.org/abs/2308.15180v1
- Date: Tue, 29 Aug 2023 10:02:10 GMT
- Title: Small Area Estimation with Random Forests and the LASSO
- Authors: Victoire Michal, Jon Wakefield, Alexandra M. Schmidt, Alicia
Cavanaugh, Brian Robinson and Jill Baumgartner
- Abstract summary: This work is motivated by Ghanaian data available from the sixth Living Standard Survey (GLSS) and the 2010 Population and Housing Census.
We compare areal-level random forests and LASSO approaches to a frequentist forward variable selection approach and a Bayesian shrinkage method.
We find substantial between-area variation, the log consumption areal point estimates showing a 1.3-fold variation across the GAMA region.
- Score: 39.58317527488534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider random forests and LASSO methods for model-based small area
estimation when the number of areas with sampled data is a small fraction of
the total areas for which estimates are required. Abundant auxiliary
information is available for the sampled areas, from the survey, and for all
areas, from an exterior source, and the goal is to use auxiliary variables to
predict the outcome of interest. We compare areal-level random forests and
LASSO approaches to a frequentist forward variable selection approach and a
Bayesian shrinkage method. Further, to measure the uncertainty of estimates
obtained from random forests and the LASSO, we propose a modification of the
split conformal procedure that relaxes the assumption of identically
distributed data. This work is motivated by Ghanaian data available from the
sixth Living Standard Survey (GLSS) and the 2010 Population and Housing Census.
We estimate the areal mean household log consumption using both datasets. The
outcome variable is measured only in the GLSS for 3\% of all the areas (136 out
of 5019) and more than 170 potential covariates are available from both
datasets. Among the four modelling methods considered, the Bayesian shrinkage
performed the best in terms of bias, MSE and prediction interval coverages and
scores, as assessed through a cross-validation study. We find substantial
between-area variation, the log consumption areal point estimates showing a
1.3-fold variation across the GAMA region. The western areas are the poorest
while the Accra Metropolitan Area district gathers the richest areas.
Related papers
- Less is More: Fewer Interpretable Region via Submodular Subset Selection [54.07758302264416]
This paper re-models the above image attribution problem as a submodular subset selection problem.
We construct a novel submodular function to discover more accurate small interpretation regions.
For correctly predicted samples, the proposed method improves the Deletion and Insertion scores with an average of 4.9% and 2.5% gain relative to HSIC-Attribution.
arXiv Detail & Related papers (2024-02-14T13:30:02Z) - A step towards the integration of machine learning and small area
estimation [0.0]
We propose a predictor supported by machine learning algorithms which can be used to predict any population or subpopulation characteristics.
We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well.
What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods.
arXiv Detail & Related papers (2024-02-12T09:43:17Z) - Numerically assisted determination of local models in network scenarios [55.2480439325792]
We develop a numerical tool for finding explicit local models that reproduce a given statistical behaviour.
We provide conjectures for the critical visibilities of the Greenberger-Horne-Zeilinger (GHZ) and W distributions.
The developed codes and documentation are publicly available at281.com/mariofilho/localmodels.
arXiv Detail & Related papers (2023-03-17T13:24:04Z) - Habitat classification from satellite observations with sparse
annotations [4.164845768197488]
We propose a method for habitat classification and mapping using remote sensing data.
The method is characterized by using finely-grained, sparse, single-pixel annotations collected from the field.
We show that cropping augmentations, test-time augmentation and semi-supervised learning can help classification even further.
arXiv Detail & Related papers (2022-09-26T20:14:59Z) - Country-wide Retrieval of Forest Structure From Optical and SAR
Satellite Imagery With Bayesian Deep Learning [74.94436509364554]
We propose a Bayesian deep learning approach to densely estimate forest structure variables at country-scale with 10-meter resolution.
Our method jointly transforms Sentinel-2 optical images and Sentinel-1 synthetic aperture radar images into maps of five different forest structure variables.
We train and test our model on reference data from 41 airborne laser scanning missions across Norway.
arXiv Detail & Related papers (2021-11-25T16:21:28Z) - A windowed correlation based feature selection method to improve time
series prediction of dengue fever cases [0.20072624123275526]
Poor performance in prediction can result in places with inadequate data.
New framework is presented for windowing incidence data and computing time-shifted correlation-based metrics.
Recurrent neural network-based prediction models achieve up to 33.6% accuracy improvement on average.
arXiv Detail & Related papers (2021-04-21T00:28:28Z) - Balancing Biases and Preserving Privacy on Balanced Faces in the Wild [50.915684171879036]
There are demographic biases present in current facial recognition (FR) models.
We introduce our Balanced Faces in the Wild dataset to measure these biases across different ethnic and gender subgroups.
We find that relying on a single score threshold to differentiate between genuine and imposters sample pairs leads to suboptimal results.
We propose a novel domain adaptation learning scheme that uses facial features extracted from state-of-the-art neural networks.
arXiv Detail & Related papers (2021-03-16T15:05:49Z) - Magnify Your Population: Statistical Downscaling to Augment the Spatial
Resolution of Socioeconomic Census Data [48.7576911714538]
We present a new statistical downscaling approach to derive fine-scale estimates of key socioeconomic attributes.
For each selected socioeconomic variable, a Random Forest model is trained on the source Census units and then used to generate fine-scale gridded predictions.
As a case study, we apply this method to Census data in the United States, downscaling the selected socioeconomic variables available at the block group level, to a grid of 300 spatial resolution.
arXiv Detail & Related papers (2020-06-23T16:52:18Z) - Towards Adaptive Benthic Habitat Mapping [9.904746542801838]
We show how a habitat model can be used to plan efficient Autonomous Underwater Vehicles (AUVs) surveys.
A Bayesian neural network is used to predict visually-derived habitat classes when given broad-scale bathymetric data.
We demonstrate how these structured uncertainty estimates can be utilised to improve the model with fewer samples.
arXiv Detail & Related papers (2020-06-20T01:03:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.