Gender Prediction Based on Vietnamese Names with Machine Learning
Techniques
- URL: http://arxiv.org/abs/2010.10852v4
- Date: Tue, 23 Mar 2021 07:25:00 GMT
- Title: Gender Prediction Based on Vietnamese Names with Machine Learning
Techniques
- Authors: Huy Quoc To, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen, Anh Gia-Tuan
Nguyen
- Abstract summary: We propose a new dataset for gender prediction based on Vietnamese names.
This dataset comprises over 26,000 full names annotated with genders.
This paper describes six machine learning algorithms and a deep learning model (LSTM) with fastText word embedding for gender prediction on Vietnamese names.
- Score: 2.7528170226206443
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As biological gender is one of the aspects of presenting individual human,
much work has been done on gender classification based on people names. The
proposals for English and Chinese languages are tremendous; still, there have
been few works done for Vietnamese so far. We propose a new dataset for gender
prediction based on Vietnamese names. This dataset comprises over 26,000 full
names annotated with genders. This dataset is available on our website for
research purposes. In addition, this paper describes six machine learning
algorithms (Support Vector Machine, Multinomial Naive Bayes, Bernoulli Naive
Bayes, Decision Tree, Random Forrest and Logistic Regression) and a deep
learning model (LSTM) with fastText word embedding for gender prediction on
Vietnamese names. We create a dataset and investigate the impact of each name
component on detecting gender. As a result, the best F1-score that we have
achieved is up to 96% on LSTM model and we generate a web API based on our
trained model.
Related papers
- For the Misgendered Chinese in Gender Bias Research: Multi-Task Learning with Knowledge Distillation for Pinyin Name-Gender Prediction [8.287754685560815]
We formulate the Pinyin name-gender guessing problem and design a Multi-Task Learning Network assisted by Knowledge Distillation.
Our open-sourced method surpasses commercial name-gender guessing tools by 9.70% to 20.08% relatively, and also outperforms the state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-10T03:16:07Z) - Gender inference: can chatGPT outperform common commercial tools? [0.0]
We compare the performance of a generative Artificial Intelligence (AI) tool ChatGPT with three commercially available list-based and machine learning-based gender inference tools.
Specifically, we use a large Olympic athlete dataset and report how variations in the input (e.g., first name and first and last name) impact the accuracy of their predictions.
ChatGPT performs at least as well as Namsor and often outperforms it, especially for the female sample when country and/or last name information is available.
arXiv Detail & Related papers (2023-11-24T22:09:14Z) - Gendec: A Machine Learning-based Framework for Gender Detection from
Japanese Names [0.0]
This work presents a novel dataset for Japanese name gender detection comprising 64,139 full names in romaji, hiragana, and kanji forms, along with their biological genders.
We propose Gendec, a framework for gender detection from Japanese names that leverages diverse approaches, including traditional machine learning techniques or cutting-edge transfer learning models.
arXiv Detail & Related papers (2023-11-18T07:46:59Z) - Will the Prince Get True Love's Kiss? On the Model Sensitivity to Gender
Perturbation over Fairytale Texts [87.62403265382734]
Recent studies show that traditional fairytales are rife with harmful gender biases.
This work aims to assess learned biases of language models by evaluating their robustness against gender perturbations.
arXiv Detail & Related papers (2023-10-16T22:25:09Z) - Towards Understanding Gender-Seniority Compound Bias in Natural Language
Generation [64.65911758042914]
We investigate how seniority impacts the degree of gender bias exhibited in pretrained neural generation models.
Our results show that GPT-2 amplifies bias by considering women as junior and men as senior more often than the ground truth in both domains.
These results suggest that NLP applications built using GPT-2 may harm women in professional capacities.
arXiv Detail & Related papers (2022-05-19T20:05:02Z) - Predicting gender of Brazilian names using deep learning [0.0]
Some machine learning algorithms can satisfactorily perform the prediction.
A dataset of Brazilian names is used to train and evaluate the models.
Some models accurately predict the gender in more than 90% of the cases.
arXiv Detail & Related papers (2021-06-18T14:45:59Z) - Quantifying Gender Bias Towards Politicians in Cross-Lingual Language
Models [104.41668491794974]
We quantify the usage of adjectives and verbs generated by language models surrounding the names of politicians as a function of their gender.
We find that while some words such as dead, and designated are associated with both male and female politicians, a few specific words such as beautiful and divorced are predominantly associated with female politicians.
arXiv Detail & Related papers (2021-04-15T15:03:26Z) - They, Them, Theirs: Rewriting with Gender-Neutral English [56.14842450974887]
We perform a case study on the singular they, a common way to promote gender inclusion in English.
We show how a model can be trained to produce gender-neutral English with 1% word error rate with no human-labeled data.
arXiv Detail & Related papers (2021-02-12T21:47:48Z) - What's in a Name? -- Gender Classification of Names with Character Based
Machine Learning Models [6.805167389805055]
We consider the problem of predicting the gender of registered users based on their declared name.
By analyzing the first names of 100M+ users, we found that genders can be very effectively classified using the composition of the name strings.
arXiv Detail & Related papers (2021-02-07T01:01:32Z) - Mitigating Gender Bias in Captioning Systems [56.25457065032423]
Most captioning models learn gender bias, leading to high gender prediction errors, especially for women.
We propose a new Guided Attention Image Captioning model (GAIC) which provides self-guidance on visual attention to encourage the model to capture correct gender visual evidence.
arXiv Detail & Related papers (2020-06-15T12:16:19Z) - Multi-Dimensional Gender Bias Classification [67.65551687580552]
Machine learning models can inadvertently learn socially undesirable patterns when training on gender biased text.
We propose a general framework that decomposes gender bias in text along several pragmatic and semantic dimensions.
Using this fine-grained framework, we automatically annotate eight large scale datasets with gender information.
arXiv Detail & Related papers (2020-05-01T21:23:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.