Race and ethnicity data for first, middle, and last names
- URL: http://arxiv.org/abs/2208.12443v1
- Date: Fri, 26 Aug 2022 05:27:50 GMT
- Title: Race and ethnicity data for first, middle, and last names
- Authors: Evan T. R. Rosenman, Santiago Olivella, and Kosuke Imai
- Abstract summary: We provide the largest compiled publicly available dictionaries of first, middle, and last names for imputing race and ethnicity.
The dictionaries are based on the voter files of six Southern states that collect self-reported racial data upon voter registration.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We provide the largest compiled publicly available dictionaries of first,
middle, and last names for the purpose of imputing race and ethnicity using,
for example, Bayesian Improved Surname Geocoding (BISG). The dictionaries are
based on the voter files of six Southern states that collect self-reported
racial data upon voter registration. Our data cover a much larger scope of
names than any comparable dataset, containing roughly one million first names,
1.1 million middle names, and 1.4 million surnames. Individuals are categorized
into five mutually exclusive racial and ethnic groups -- White, Black,
Hispanic, Asian, and Other -- and racial/ethnic counts by name are provided for
every name in each dictionary. Counts can then be normalized row-wise or
column-wise to obtain conditional probabilities of race given name or name
given race. These conditional probabilities can then be deployed for imputation
in a data analytic task for which ground truth racial and ethnic data is not
available.
Related papers
- Multicultural Name Recognition For Previously Unseen Names [65.268245109828]
This paper attempts to improve recognition of person names, a diverse category that can grow any time someone is born or changes their name.
I look at names from 103 countries to compare how well the model performs on names from different cultures.
I find that a model with combined character and word input outperforms word-only models and may improve on accuracy compared to classical NER models.
arXiv Detail & Related papers (2024-01-23T17:58:38Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - Estimating Racial Disparities When Race is Not Observed [3.0931877196387196]
We introduce a new class of models that produce racial disparity estimates by using surnames as an instrumental variable for race.
A validation study based on the North Carolina voter file shows that BISG+BIRDiE reduces error by up to 84% when estimating racial differences in party registration.
We apply the proposed methodology to estimate racial differences in who benefits from the home mortgage interest deduction using individual-level tax data from the U.S. Internal Revenue Service.
arXiv Detail & Related papers (2023-03-05T04:46:16Z) - Addressing Census data problems in race imputation via fully Bayesian
Improved Surname Geocoding and name supplements [0.0]
We introduce a fully Bayesian Improved Surname Geocoding (fBISG) methodology that accounts for potential measurement error in Census counts.
We supplement the Census surname data with additional data on last, first, and middle names taken from the voter files of six Southern states where self-reported race is available.
Our empirical validation shows that the fBISG methodology and name supplements significantly improve the accuracy of race imputation across all racial groups, and especially for Asians.
arXiv Detail & Related papers (2022-05-12T14:41:45Z) - To Find Waldo You Need Contextual Cues: Debiasing Who's Waldo [53.370023611101175]
We present a debiased dataset for the Person-centric Visual Grounding task first proposed by Cui et al.
Given an image and a caption, PCVG requires pairing up a person's name mentioned in a caption with a bounding box that points to the person in the image.
We find that the original Who's Waldo dataset contains a large number of biased samples that are solvable simply by methods.
arXiv Detail & Related papers (2022-03-30T21:35:53Z) - Rethnicity: Predicting Ethnicity from Names [0.0]
I use the Bidirectional LSTM as the model and Florida Voter Registration as training data.
Special care is given for the accuracy of minority groups, by adjusting the imbalance in the dataset.
arXiv Detail & Related papers (2021-09-19T21:30:22Z) - Avoiding bias when inferring race using name-based approaches [0.8543368663496084]
We use information from the U.S. Census and mortgage applications to infer the race of U.S. affiliated authors in the Web of Science.
Our results demonstrate that the validity of name based inference varies by race/ethnicity and that threshold approaches underestimate Black authors and overestimate White authors.
arXiv Detail & Related papers (2021-04-14T08:36:22Z) - One Label, One Billion Faces: Usage and Consistency of Racial Categories
in Computer Vision [75.82110684355979]
We study the racial system encoded by computer vision datasets supplying categorical race labels for face images.
We find that each dataset encodes a substantially unique racial system, despite nominally equivalent racial categories.
We find evidence that racial categories encode stereotypes, and exclude ethnic groups from categories on the basis of nonconformity to stereotypes.
arXiv Detail & Related papers (2021-02-03T22:50:04Z) - Text Classification Using Label Names Only: A Language Model
Self-Training Approach [80.63885282358204]
Current text classification methods typically require a good number of human-labeled documents as training data.
We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification.
arXiv Detail & Related papers (2020-10-14T17:06:41Z) - Contrastive Examples for Addressing the Tyranny of the Majority [83.93825214500131]
We propose to create a balanced training dataset, consisting of the original dataset plus new data points in which the group memberships are intervened.
We show that current generative adversarial networks are a powerful tool for learning these data points, called contrastive examples.
arXiv Detail & Related papers (2020-04-14T14:06:44Z) - Predicting Race and Ethnicity From the Sequence of Characters in a Name [0.0]
We model the relationship between characters in a name and race and ethnicity using various techniques.
A model using Long Short-Term Memory works best with out-of-sample accuracy of.85.
The best-performing last-name model achieves out-of-sample accuracy of.81.
arXiv Detail & Related papers (2018-05-05T20:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.