Can We Trust Race Prediction?
- URL: http://arxiv.org/abs/2307.08496v2
- Date: Mon, 7 Aug 2023 20:27:19 GMT
- Title: Can We Trust Race Prediction?
- Authors: Cangyuan Li
- Abstract summary: I train a Bidirectional Long Short-Term Memory (BiLSTM) model on a novel dataset of voter registration data from all 50 US states.
I construct the most comprehensive database of first and surname distributions in the US.
I provide the first high-quality benchmark dataset in order to fairly compare existing models and aid future model developers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the absence of sensitive race and ethnicity data, researchers, regulators,
and firms alike turn to proxies. In this paper, I train a Bidirectional Long
Short-Term Memory (BiLSTM) model on a novel dataset of voter registration data
from all 50 US states and create an ensemble that achieves up to 36.8% higher
out of sample (OOS) F1 scores than the best performing machine learning models
in the literature. Additionally, I construct the most comprehensive database of
first and surname distributions in the US in order to improve the coverage and
accuracy of Bayesian Improved Surname Geocoding (BISG) and Bayesian Improved
Firstname Surname Geocoding (BIFSG). Finally, I provide the first high-quality
benchmark dataset in order to fairly compare existing models and aid future
model developers.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models [84.65095045762524]
We present three desiderata for a good benchmark for language models.
benchmark reveals new trends in model rankings not shown by previous benchmarks.
We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering.
arXiv Detail & Related papers (2024-07-11T10:03:47Z) - Prompt Public Large Language Models to Synthesize Data for Private On-device Applications [5.713077600587505]
This paper investigates how large language models (LLMs) trained on public data can improve the quality of pre-training data for the on-device language models trained with DP and FL.
The model pre-trained on our synthetic dataset achieves relative improvement of 19.0% and 22.8% in next word prediction accuracy.
Our experiments demonstrate the strengths of LLMs in synthesizing data close to the private distribution even without accessing the private data.
arXiv Detail & Related papers (2024-04-05T19:14:14Z) - CURATRON: Complete and Robust Preference Data for Rigorous Alignment of Large Language Models [1.6339731044538859]
This paper addresses the challenges of aligning large language models with human values via preference learning.
We propose a novel method for robustly and maliciously manipulated AI pipeline datasets to enhance LLMs' resilience.
arXiv Detail & Related papers (2024-03-05T07:58:12Z) - Predicting the Geolocation of Tweets Using transformer models on Customized Data [17.55660062746406]
This research is aimed to solve the tweet/user geolocation prediction task.
The suggested approach implements neural networks for natural language processing to estimate the location.
The scope of proposed models has been finetuned on a Twitter dataset.
arXiv Detail & Related papers (2023-03-14T12:56:47Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - Benchmarking Bayesian Improved Surname Geocoding Against Machine
Learning Methods [0.0]
BISG is the most popular method for proxying race/ethnicity in voter registration files that do not contain it.
This paper benchmarks BISG against a range of previously untested machine learning alternatives.
Results suggest that pre-trained machine learning models are preferable to BISG for individual classification.
arXiv Detail & Related papers (2022-06-26T11:12:37Z) - Rethnicity: Predicting Ethnicity from Names [0.0]
I use the Bidirectional LSTM as the model and Florida Voter Registration as training data.
Special care is given for the accuracy of minority groups, by adjusting the imbalance in the dataset.
arXiv Detail & Related papers (2021-09-19T21:30:22Z) - USACv20: robust essential, fundamental and homography matrix estimation [68.65610177368617]
We review the most recent RANSAC-like hypothesize-and-verify robust estimators.
The best performing ones are combined to create a state-of-the-art version of the Universal Sample Consensus (USAC) algorithm.
A proposed method, USACv20, is tested on eight publicly available real-world datasets.
arXiv Detail & Related papers (2021-04-11T16:27:02Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.