Benchmarking Bayesian Improved Surname Geocoding Against Machine
Learning Methods
- URL: http://arxiv.org/abs/2206.14583v1
- Date: Sun, 26 Jun 2022 11:12:37 GMT
- Title: Benchmarking Bayesian Improved Surname Geocoding Against Machine
Learning Methods
- Authors: Ari Decter-Frain
- Abstract summary: BISG is the most popular method for proxying race/ethnicity in voter registration files that do not contain it.
This paper benchmarks BISG against a range of previously untested machine learning alternatives.
Results suggest that pre-trained machine learning models are preferable to BISG for individual classification.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bayesian Improved Surname Geocoding (BISG) is the most popular method for
proxying race/ethnicity in voter registration files that do not contain it.
This paper benchmarks BISG against a range of previously untested machine
learning alternatives, using voter files with self-reported race/ethnicity from
California, Florida, North Carolina, and Georgia. This analysis yields three
key findings. First, when given the exact same inputs, BISG and machine
learning perform similarly for estimating aggregate racial/ethnic composition.
Second, machine learning outperforms BISG at individual classification of
race/ethnicity. Third, the performance of all methods varies substantially
across states. These results suggest that pre-trained machine learning models
are preferable to BISG for individual classification. Furthermore, mixed
results at the precinct level and across states underscore the need for
researchers to empirically validate their chosen race/ethnicity proxy in their
populations of interest.
Related papers
- A robust three-way classifier with shadowed granular-balls based on justifiable granularity [53.39844791923145]
We construct a robust three-way classifier with shadowed GBs for uncertain data.
Our model demonstrates in managing uncertain data and effectively mitigates classification risks.
arXiv Detail & Related papers (2024-07-03T08:54:45Z) - JointMatch: A Unified Approach for Diverse and Collaborative
Pseudo-Labeling to Semi-Supervised Text Classification [65.268245109828]
Semi-supervised text classification (SSTC) has gained increasing attention due to its ability to leverage unlabeled data.
Existing approaches based on pseudo-labeling suffer from the issues of pseudo-label bias and error accumulation.
We propose JointMatch, a holistic approach for SSTC that addresses these challenges by unifying ideas from recent semi-supervised learning.
arXiv Detail & Related papers (2023-10-23T05:43:35Z) - Can We Trust Race Prediction? [0.0]
I train a Bidirectional Long Short-Term Memory (BiLSTM) model on a novel dataset of voter registration data from all 50 US states.
I construct the most comprehensive database of first and surname distributions in the US.
I provide the first high-quality benchmark dataset in order to fairly compare existing models and aid future model developers.
arXiv Detail & Related papers (2023-07-17T13:59:07Z) - Enhancing Pashto Text Classification using Language Processing
Techniques for Single And Multi-Label Analysis [0.0]
This study aims to establish an automated classification system for Pashto text.
The study achieved an average testing accuracy rate of 94%.
The use of pre-trained language representation models, such as DistilBERT, showed promising results.
arXiv Detail & Related papers (2023-05-04T23:11:31Z) - Change is Hard: A Closer Look at Subpopulation Shift [48.0369745740936]
We propose a unified framework that dissects and explains common shifts in subgroups.
We then establish a benchmark of 20 state-of-the-art algorithms evaluated on 12 real-world datasets in vision, language, and healthcare domains.
arXiv Detail & Related papers (2023-02-23T18:59:56Z) - Parametric Classification for Generalized Category Discovery: A Baseline
Study [70.73212959385387]
Generalized Category Discovery (GCD) aims to discover novel categories in unlabelled datasets using knowledge learned from labelled samples.
We investigate the failure of parametric classifiers, verify the effectiveness of previous design choices when high-quality supervision is available, and identify unreliable pseudo-labels as a key problem.
We propose a simple yet effective parametric classification method that benefits from entropy regularisation, achieves state-of-the-art performance on multiple GCD benchmarks and shows strong robustness to unknown class numbers.
arXiv Detail & Related papers (2022-11-21T18:47:11Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z) - Fewer is More: A Deep Graph Metric Learning Perspective Using Fewer
Proxies [65.92826041406802]
We propose a Proxy-based deep Graph Metric Learning approach from the perspective of graph classification.
Multiple global proxies are leveraged to collectively approximate the original data points for each class.
We design a novel reverse label propagation algorithm, by which the neighbor relationships are adjusted according to ground-truth labels.
arXiv Detail & Related papers (2020-10-26T14:52:42Z) - K-Nearest Neighbour and Support Vector Machine Hybrid Classification [0.0]
The technique consists of using K-Nearest Neighbour Classification for test samples satisfying a proximity condition.
For every separated test sample, a Support Vector Machine is trained on the sifted training set patterns associated with it, and classification for the test sample is done.
arXiv Detail & Related papers (2020-06-28T15:26:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.