Related papers: Agentic Username Suggestion and Multimodal Gender Detection in Online Platforms: Introducing the PNGT-26K Dataset

Agentic Username Suggestion and Multimodal Gender Detection in Online Platforms: Introducing the PNGT-26K Dataset

URL: http://arxiv.org/abs/2509.11136v1
Date: Sun, 14 Sep 2025 07:08:32 GMT
Title: Agentic Username Suggestion and Multimodal Gender Detection in Online Platforms: Introducing the PNGT-26K Dataset
Authors: Farbod Bijary, Mohsen Ebadpour, Amirhosein Tajbakhsh,
Abstract summary: This paper introduces PNGT-26K, a comprehensive dataset of Persian names, their commonly associated gender, and their English transliteration, consisting of approximately 26,000s.<n>We also introduce two frameworks, namely Open Gender Detection and Nominalist.<n>The PNGT-26K dataset, Nominalist and Open Gender Detection frameworks are publicly available on Github.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Persian names present unique challenges for natural language processing applications, particularly in gender detection and digital identity creation, due to transliteration inconsistencies and cultural-specific naming patterns. Existing tools exhibit significant performance degradation on Persian names, while the scarcity of comprehensive datasets further compounds these limitations. To address these challenges, the present research introduces PNGT-26K, a comprehensive dataset of Persian names, their commonly associated gender, and their English transliteration, consisting of approximately 26,000 tuples. As a demonstration of how this resource can be utilized, we also introduce two frameworks, namely Open Gender Detection and Nominalist. Open Gender Detection is a production-grade, ready-to-use framework for using existing data from a user, such as profile photo and name, to give a probabilistic guess about the person's gender. Nominalist, the second framework introduced by this paper, utilizes agentic AI to help users choose a username for their social media accounts on any platform. It can be easily integrated into any website to provide a better user experience. The PNGT-26K dataset, Nominalist and Open Gender Detection frameworks are publicly available on Github.

Related papers

AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics. We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z)
Breaking the Silence Detecting and Mitigating Gendered Abuse in Hindi, Tamil, and Indian English Online Spaces [0.6543929004971272]
Team CNLP-NITS-PP developed an ensemble approach combining CNN and BiLSTM networks. CNN captures localized features indicative of abusive language through its convolution filters applied on embedded input text. BiLSTM analyzes this sequence for dependencies among words and phrases. validation scores showed strong performance across f1-measures, especially for English 0.84.
arXiv Detail & Related papers (2024-04-02T14:55:47Z)
Gendec: A Machine Learning-based Framework for Gender Detection from Japanese Names [0.0]
This work presents a novel dataset for Japanese name gender detection comprising 64,139 full names in romaji, hiragana, and kanji forms, along with their biological genders. We propose Gendec, a framework for gender detection from Japanese names that leverages diverse approaches, including traditional machine learning techniques or cutting-edge transfer learning models.
arXiv Detail & Related papers (2023-11-18T07:46:59Z)
The Gender-GAP Pipeline: A Gender-Aware Polyglot Pipeline for Gender Characterisation in 55 Languages [51.2321117760104]
This paper describes the Gender-GAP Pipeline, an automatic pipeline to characterize gender representation in large-scale datasets for 55 languages. The pipeline uses a multilingual lexicon of gendered person-nouns to quantify the gender representation in text. We showcase it to report gender representation in WMT training data and development data for the News task, confirming that current data is skewed towards masculine representation.
arXiv Detail & Related papers (2023-08-31T17:20:50Z)
VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution [80.57383975987676]
VisoGender is a novel dataset for benchmarking gender bias in vision-language models. We focus on occupation-related biases within a hegemonic system of binary gender, inspired by Winograd and Winogender schemas. We benchmark several state-of-the-art vision-language models and find that they demonstrate bias in resolving binary gender in complex scenes.
arXiv Detail & Related papers (2023-06-21T17:59:51Z)
For the Underrepresented in Gender Bias Research: Chinese Name Gender Prediction with Heterogeneous Graph Attention Network [1.13608321568471]
We design a Chinese Heterogeneous Graph Attention (CHGAT) model to capture the heterogeneity in component relationships and incorporate the pronunciations of characters. Our model largely surpasses current tools and also outperforms the state-of-the-art algorithm. We open-source a more balanced multi-character dataset from an official source together with our code, hoping to help future research promoting gender equality.
arXiv Detail & Related papers (2023-02-01T13:08:50Z)
MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG) MuRAG accesses an external non-parametric multimodal memory to augment language generation. Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z)
They, Them, Theirs: Rewriting with Gender-Neutral English [56.14842450974887]
We perform a case study on the singular they, a common way to promote gender inclusion in English. We show how a model can be trained to produce gender-neutral English with 1% word error rate with no human-labeled data.
arXiv Detail & Related papers (2021-02-12T21:47:48Z)
What's in a Name? -- Gender Classification of Names with Character Based Machine Learning Models [6.805167389805055]
We consider the problem of predicting the gender of registered users based on their declared name. By analyzing the first names of 100M+ users, we found that genders can be very effectively classified using the composition of the name strings.
arXiv Detail & Related papers (2021-02-07T01:01:32Z)
Gender Prediction Based on Vietnamese Names with Machine Learning Techniques [2.7528170226206443]
We propose a new dataset for gender prediction based on Vietnamese names. This dataset comprises over 26,000 full names annotated with genders. This paper describes six machine learning algorithms and a deep learning model (LSTM) with fastText word embedding for gender prediction on Vietnamese names.
arXiv Detail & Related papers (2020-10-21T09:25:48Z)
Mitigating Gender Bias in Captioning Systems [56.25457065032423]
Most captioning models learn gender bias, leading to high gender prediction errors, especially for women. We propose a new Guided Attention Image Captioning model (GAIC) which provides self-guidance on visual attention to encourage the model to capture correct gender visual evidence.
arXiv Detail & Related papers (2020-06-15T12:16:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.