Dataset Creation and Baseline Models for Sexism Detection in Hausa
- URL: http://arxiv.org/abs/2510.27038v1
- Date: Thu, 30 Oct 2025 22:57:35 GMT
- Title: Dataset Creation and Baseline Models for Sexism Detection in Hausa
- Authors: Fatima Adam Muhammad, Shamsuddeen Muhammad Hassan, Isa Inuwa-Dutse,
- Abstract summary: This study introduces the first Hausa sexism detection dataset, developed through community engagement, qualitative coding, and data augmentation.<n>For cultural nuances and linguistic representation, we conducted a two-stage user study involving native speakers to explore how sexism is defined and articulated in everyday discourse.<n>Our findings highlight challenges in capturing cultural nuance, particularly with clarification-seeking and idiomatic expressions, and reveal a tendency for many false positives in such cases.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sexism reinforces gender inequality and social exclusion by perpetuating stereotypes, bias, and discriminatory norms. Noting how online platforms enable various forms of sexism to thrive, there is a growing need for effective sexism detection and mitigation strategies. While computational approaches to sexism detection are widespread in high-resource languages, progress remains limited in low-resource languages where limited linguistic resources and cultural differences affect how sexism is expressed and perceived. This study introduces the first Hausa sexism detection dataset, developed through community engagement, qualitative coding, and data augmentation. For cultural nuances and linguistic representation, we conducted a two-stage user study (n=66) involving native speakers to explore how sexism is defined and articulated in everyday discourse. We further experiment with both traditional machine learning classifiers and pre-trained multilingual language models and evaluating the effectiveness few-shot learning in detecting sexism in Hausa. Our findings highlight challenges in capturing cultural nuance, particularly with clarification-seeking and idiomatic expressions, and reveal a tendency for many false positives in such cases.
Related papers
- Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos [12.430871103040275]
We present FineMuSe, a new multimodal sexism detection dataset in Spanish.<n>We also introduce a comprehensive hierarchical taxonomy that encompasses forms of sexism, non-sexism, and rhetorical devices of irony and humor.<n>Our findings indicate that multimodal LLMs perform competitively with human annotators in identifying nuanced forms of sexism.
arXiv Detail & Related papers (2026-02-17T17:45:28Z) - Demographic Biases and Gaps in the Perception of Sexism in Large Language Models [43.77504559722899]
We explore the capabilities of different Large Language Models to detect sexism in social media text.<n>We analyze the demographic biases present in the models and conduct a statistical analysis.<n>Our results show that, while LLMs can to some extent detect sexism when considering the overall opinion of populations, they do not accurately replicate the diversity of perceptions among different demographic groups.
arXiv Detail & Related papers (2025-08-25T17:36:58Z) - EuroGEST: Investigating gender stereotypes in multilingual language models [58.871032460235575]
We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages.<n>We show that the strongest stereotypes in all models across all languages are that women are 'beautiful', 'empathetic' and 'neat' and men are 'leaders','strong, tough' and 'professional'
arXiv Detail & Related papers (2025-06-04T11:58:18Z) - MuSeD: A Multimodal Spanish Dataset for Sexism Detection in Social Media Videos [12.555579923843641]
We introduce a new Multimodal Spanish dataset for Sexism Detection consisting of $approx$ 11 hours of videos extracted from TikTok and BitChute.<n>We find that visual information plays a key role in labeling sexist content for both humans and models.
arXiv Detail & Related papers (2025-04-15T13:16:46Z) - A multitask learning framework for leveraging subjectivity of annotators to identify misogyny [47.175010006458436]
We propose a multitask learning approach to enhance the performance of the misogyny identification systems.
We incorporated diverse perspectives from annotators in our model design, considering gender and age across six profile groups.
This research advances content moderation and highlights the importance of embracing diverse perspectives to build effective online moderation systems.
arXiv Detail & Related papers (2024-06-22T15:06:08Z) - Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You [64.74707085021858]
We show that multilingual models suffer from significant gender biases just as monolingual models do.<n>We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models.<n>Our results show that not only do models exhibit strong gender biases but they also behave differently across languages.
arXiv Detail & Related papers (2024-01-29T12:02:28Z) - SexWEs: Domain-Aware Word Embeddings via Cross-lingual Semantic
Specialisation for Chinese Sexism Detection in Social Media [23.246615034191553]
We develop a cross-lingual domain-aware semantic specialisation system for sexism detection.
We leverage semantic resources for sexism from a high-resource language (English) to specialise pre-trained word vectors in the target language (Chinese) to inject domain knowledge.
Compared with other specialisation approaches and Chinese baseline word vectors, our SexWEs shows an average score improvement of 0.033 and 0.064 in both intrinsic and extrinsic evaluations.
arXiv Detail & Related papers (2022-11-15T19:00:20Z) - Analyzing Gender Representation in Multilingual Models [59.21915055702203]
We focus on the representation of gender distinctions as a practical case study.
We examine the extent to which the gender concept is encoded in shared subspaces across different languages.
arXiv Detail & Related papers (2022-04-20T00:13:01Z) - SWSR: A Chinese Dataset and Lexicon for Online Sexism Detection [9.443571652110663]
We propose the first Chinese sexism dataset -- Sina Weibo Sexism Review (SWSR) dataset -- and a large Chinese lexicon SexHateLex.
SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type.
We conduct experiments for the three sexism classification tasks making use of state-of-the-art machine learning models.
arXiv Detail & Related papers (2021-08-06T12:06:40Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Quantifying Gender Bias Towards Politicians in Cross-Lingual Language
Models [104.41668491794974]
We quantify the usage of adjectives and verbs generated by language models surrounding the names of politicians as a function of their gender.
We find that while some words such as dead, and designated are associated with both male and female politicians, a few specific words such as beautiful and divorced are predominantly associated with female politicians.
arXiv Detail & Related papers (2021-04-15T15:03:26Z) - "Call me sexist, but...": Revisiting Sexism Detection Using
Psychological Scales and Adversarial Samples [2.029924828197095]
We outline the different dimensions of sexism by grounding them in their implementation in psychological scales.
From the scales, we derive a codebook for sexism in social media, which we use to annotate existing and novel datasets.
Results indicate that current machine learning models pick up on a very narrow set of linguistic markers of sexism and do not generalize well to out-of-domain examples.
arXiv Detail & Related papers (2020-04-27T13:07:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.