Rethnicity: Predicting Ethnicity from Names
- URL: http://arxiv.org/abs/2109.09228v1
- Date: Sun, 19 Sep 2021 21:30:22 GMT
- Title: Rethnicity: Predicting Ethnicity from Names
- Authors: Fangzhou Xie
- Abstract summary: I use the Bidirectional LSTM as the model and Florida Voter Registration as training data.
Special care is given for the accuracy of minority groups, by adjusting the imbalance in the dataset.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: I provide an R package, \texttt{rethnicity}, for predicting ethnicity from
names. I use the Bidirectional LSTM as the model and Florida Voter Registration
as training data. Special care is given for the accuracy of minority groups, by
adjusting the imbalance in the dataset. I also compare the availability,
accuracy, and performance with other solutions for predicting ethnicity from
names. Sample code snippet and analysis of the DIME dataset are also shown as
applications of the package.
Related papers
- Multicultural Name Recognition For Previously Unseen Names [65.268245109828]
This paper attempts to improve recognition of person names, a diverse category that can grow any time someone is born or changes their name.
I look at names from 103 countries to compare how well the model performs on names from different cultures.
I find that a model with combined character and word input outperforms word-only models and may improve on accuracy compared to classical NER models.
arXiv Detail & Related papers (2024-01-23T17:58:38Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Can We Trust Race Prediction? [0.0]
I train a Bidirectional Long Short-Term Memory (BiLSTM) model on a novel dataset of voter registration data from all 50 US states.
I construct the most comprehensive database of first and surname distributions in the US.
I provide the first high-quality benchmark dataset in order to fairly compare existing models and aid future model developers.
arXiv Detail & Related papers (2023-07-17T13:59:07Z) - Estimating Racial Disparities When Race is Not Observed [3.0931877196387196]
We introduce a new class of models that produce racial disparity estimates by using surnames as an instrumental variable for race.
A validation study based on the North Carolina voter file shows that BISG+BIRDiE reduces error by up to 84% when estimating racial differences in party registration.
We apply the proposed methodology to estimate racial differences in who benefits from the home mortgage interest deduction using individual-level tax data from the U.S. Internal Revenue Service.
arXiv Detail & Related papers (2023-03-05T04:46:16Z) - Data Selection for Language Models via Importance Resampling [90.9263039747723]
We formalize the problem of selecting a subset of a large raw unlabeled dataset to match a desired target distribution.
We extend the classic importance resampling approach used in low-dimensions for LM data selection.
We instantiate the DSIR framework with hashed n-gram features for efficiency, enabling the selection of 100M documents in 4.5 hours.
arXiv Detail & Related papers (2023-02-06T23:57:56Z) - Race and ethnicity data for first, middle, and last names [0.0]
We provide the largest compiled publicly available dictionaries of first, middle, and last names for imputing race and ethnicity.
The dictionaries are based on the voter files of six Southern states that collect self-reported racial data upon voter registration.
arXiv Detail & Related papers (2022-08-26T05:27:50Z) - Addressing Census data problems in race imputation via fully Bayesian
Improved Surname Geocoding and name supplements [0.0]
We introduce a fully Bayesian Improved Surname Geocoding (fBISG) methodology that accounts for potential measurement error in Census counts.
We supplement the Census surname data with additional data on last, first, and middle names taken from the voter files of six Southern states where self-reported race is available.
Our empirical validation shows that the fBISG methodology and name supplements significantly improve the accuracy of race imputation across all racial groups, and especially for Asians.
arXiv Detail & Related papers (2022-05-12T14:41:45Z) - Cross-Model Pseudo-Labeling for Semi-Supervised Action Recognition [98.25592165484737]
We propose a more effective pseudo-labeling scheme, called Cross-Model Pseudo-Labeling (CMPL)
CMPL achieves $17.6%$ and $25.1%$ Top-1 accuracy on Kinetics-400 and UCF-101 using only the RGB modality and $1%$ labeled data, respectively.
arXiv Detail & Related papers (2021-12-17T18:59:41Z) - Balancing Biases and Preserving Privacy on Balanced Faces in the Wild [50.915684171879036]
There are demographic biases present in current facial recognition (FR) models.
We introduce our Balanced Faces in the Wild dataset to measure these biases across different ethnic and gender subgroups.
We find that relying on a single score threshold to differentiate between genuine and imposters sample pairs leads to suboptimal results.
We propose a novel domain adaptation learning scheme that uses facial features extracted from state-of-the-art neural networks.
arXiv Detail & Related papers (2021-03-16T15:05:49Z) - The Gap on GAP: Tackling the Problem of Differing Data Distributions in
Bias-Measuring Datasets [58.53269361115974]
Diagnostic datasets that can detect biased models are an important prerequisite for bias reduction within natural language processing.
undesired patterns in the collected data can make such tests incorrect.
We introduce a theoretically grounded method for weighting test samples to cope with such patterns in the test data.
arXiv Detail & Related papers (2020-11-03T16:50:13Z) - Predicting Race and Ethnicity From the Sequence of Characters in a Name [0.0]
We model the relationship between characters in a name and race and ethnicity using various techniques.
A model using Long Short-Term Memory works best with out-of-sample accuracy of.85.
The best-performing last-name model achieves out-of-sample accuracy of.81.
arXiv Detail & Related papers (2018-05-05T20:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.