Predicting Race and Ethnicity From the Sequence of Characters in a Name
- URL: http://arxiv.org/abs/1805.02109v2
- Date: Sat, 8 Jul 2023 01:41:11 GMT
- Title: Predicting Race and Ethnicity From the Sequence of Characters in a Name
- Authors: Rajashekar Chintalapati, Suriyan Laohaprapanon, and Gaurav Sood
- Abstract summary: We model the relationship between characters in a name and race and ethnicity using various techniques.
A model using Long Short-Term Memory works best with out-of-sample accuracy of.85.
The best-performing last-name model achieves out-of-sample accuracy of.81.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To answer questions about racial inequality and fairness, we often need a way
to infer race and ethnicity from names. One way to infer race and ethnicity
from names is by relying on the Census Bureau's list of popular last names. The
list, however, suffers from at least three limitations: 1. it only contains
last names, 2. it only includes popular last names, and 3. it is updated once
every 10 years. To provide better generalization, and higher accuracy when
first names are available, we model the relationship between characters in a
name and race and ethnicity using various techniques. A model using Long
Short-Term Memory works best with out-of-sample accuracy of .85. The
best-performing last-name model achieves out-of-sample accuracy of .81. To
illustrate the utility of the models, we apply them to campaign finance data to
estimate the share of donations made by people of various racial groups, and to
news data to estimate the coverage of various races and ethnicities in the
news.
Related papers
- Uncovering Name-Based Biases in Large Language Models Through Simulated Trust Game [0.0]
Gender and race inferred from an individual's name are a notable source of stereotypes and biases that subtly influence social interactions.
We show that our approach can detect name-based biases in both base and instruction-tuned models.
arXiv Detail & Related papers (2024-04-23T02:21:17Z) - What's in a Name? Auditing Large Language Models for Race and Gender
Bias [49.28899492966893]
We employ an audit design to investigate biases in state-of-the-art large language models, including GPT-4.
We find that the advice systematically disadvantages names that are commonly associated with racial minorities and women.
arXiv Detail & Related papers (2024-02-21T18:25:25Z) - Multicultural Name Recognition For Previously Unseen Names [65.268245109828]
This paper attempts to improve recognition of person names, a diverse category that can grow any time someone is born or changes their name.
I look at names from 103 countries to compare how well the model performs on names from different cultures.
I find that a model with combined character and word input outperforms word-only models and may improve on accuracy compared to classical NER models.
arXiv Detail & Related papers (2024-01-23T17:58:38Z) - Probabilistic Test-Time Generalization by Variational Neighbor-Labeling [62.158807685159736]
This paper strives for domain generalization, where models are trained exclusively on source domains before being deployed on unseen target domains.
Probability pseudo-labeling of target samples to generalize the source-trained model to the target domain at test time.
Variational neighbor labels that incorporate the information of neighboring target samples to generate more robust pseudo labels.
arXiv Detail & Related papers (2023-07-08T18:58:08Z) - Race and ethnicity data for first, middle, and last names [0.0]
We provide the largest compiled publicly available dictionaries of first, middle, and last names for imputing race and ethnicity.
The dictionaries are based on the voter files of six Southern states that collect self-reported racial data upon voter registration.
arXiv Detail & Related papers (2022-08-26T05:27:50Z) - Addressing Census data problems in race imputation via fully Bayesian
Improved Surname Geocoding and name supplements [0.0]
We introduce a fully Bayesian Improved Surname Geocoding (fBISG) methodology that accounts for potential measurement error in Census counts.
We supplement the Census surname data with additional data on last, first, and middle names taken from the voter files of six Southern states where self-reported race is available.
Our empirical validation shows that the fBISG methodology and name supplements significantly improve the accuracy of race imputation across all racial groups, and especially for Asians.
arXiv Detail & Related papers (2022-05-12T14:41:45Z) - raceBERT -- A Transformer-based Model for Predicting Race and Ethnicity
from Names [0.0]
raceBERT is a transformer-based model for predicting race and ethnicity from character sequences in names.
It achieves state-of-the-art results in race prediction using names, with an average f1-score of 0.86 -- a 4.1% improvement over the previous state-of-the-art, and improvements between 15-17% for non-white names.
arXiv Detail & Related papers (2021-12-07T16:30:40Z) - Rethnicity: Predicting Ethnicity from Names [0.0]
I use the Bidirectional LSTM as the model and Florida Voter Registration as training data.
Special care is given for the accuracy of minority groups, by adjusting the imbalance in the dataset.
arXiv Detail & Related papers (2021-09-19T21:30:22Z) - Text Classification Using Label Names Only: A Language Model
Self-Training Approach [80.63885282358204]
Current text classification methods typically require a good number of human-labeled documents as training data.
We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification.
arXiv Detail & Related papers (2020-10-14T17:06:41Z) - Investigating Cross-Linguistic Adjective Ordering Tendencies with a
Latent-Variable Model [66.84264870118723]
We present the first purely corpus-driven model of multi-lingual adjective ordering in the form of a latent-variable model.
We provide strong converging evidence for the existence of universal, cross-linguistic, hierarchical adjective ordering tendencies.
arXiv Detail & Related papers (2020-10-09T18:27:55Z) - A Brief Survey and Comparative Study of Recent Development of Pronoun
Coreference Resolution [55.39835612617972]
Pronoun Coreference Resolution (PCR) is the task of resolving pronominal expressions to all mentions they refer to.
As one important natural language understanding (NLU) component, pronoun resolution is crucial for many downstream tasks and still challenging for existing models.
We conduct extensive experiments to show that even though current models are achieving good performance on the standard evaluation set, they are still not ready to be used in real applications.
arXiv Detail & Related papers (2020-09-27T01:40:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.