SWSR: A Chinese Dataset and Lexicon for Online Sexism Detection
- URL: http://arxiv.org/abs/2108.03070v1
- Date: Fri, 6 Aug 2021 12:06:40 GMT
- Title: SWSR: A Chinese Dataset and Lexicon for Online Sexism Detection
- Authors: Aiqi Jiang, Xiaohan Yang, Yang Liu, Arkaitz Zubiaga
- Abstract summary: We propose the first Chinese sexism dataset -- Sina Weibo Sexism Review (SWSR) dataset -- and a large Chinese lexicon SexHateLex.
SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type.
We conduct experiments for the three sexism classification tasks making use of state-of-the-art machine learning models.
- Score: 9.443571652110663
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online sexism has become an increasing concern in social media platforms as
it has affected the healthy development of the Internet and can have negative
effects in society. While research in the sexism detection domain is growing,
most of this research focuses on English as the language and on Twitter as the
platform. Our objective here is to broaden the scope of this research by
considering the Chinese language on Sina Weibo. We propose the first Chinese
sexism dataset -- Sina Weibo Sexism Review (SWSR) dataset --, as well as a
large Chinese lexicon SexHateLex made of abusive and gender-related terms. We
introduce our data collection and annotation process, and provide an
exploratory analysis of the dataset characteristics to validate its quality and
to show how sexism is manifested in Chinese. The SWSR dataset provides labels
at different levels of granularity including (i) sexism or non-sexism, (ii)
sexism category and (iii) target type, which can be exploited, among others,
for building computational methods to identify and investigate finer-grained
gender-related abusive language. We conduct experiments for the three sexism
classification tasks making use of state-of-the-art machine learning models.
Our results show competitive performance, providing a benchmark for sexism
detection in the Chinese language, as well as an error analysis highlighting
open challenges needing more research in Chinese NLP. The SWSR dataset and
SexHateLex lexicon are publicly available.
Related papers
- GenderCARE: A Comprehensive Framework for Assessing and Reducing Gender Bias in Large Language Models [73.23743278545321]
Large language models (LLMs) have exhibited remarkable capabilities in natural language generation, but have also been observed to magnify societal biases.
GenderCARE is a comprehensive framework that encompasses innovative Criteria, bias Assessment, Reduction techniques, and Evaluation metrics.
arXiv Detail & Related papers (2024-08-22T15:35:46Z) - Beyond Binary Gender: Evaluating Gender-Inclusive Machine Translation with Ambiguous Attitude Words [85.48043537327258]
Existing machine translation gender bias evaluations are primarily focused on male and female genders.
This study presents a benchmark AmbGIMT (Gender-Inclusive Machine Translation with Ambiguous attitude words)
We propose a novel process to evaluate gender bias based on the Emotional Attitude Score (EAS), which is used to quantify ambiguous attitude words.
arXiv Detail & Related papers (2024-07-23T08:13:51Z) - A multitask learning framework for leveraging subjectivity of annotators to identify misogyny [47.175010006458436]
We propose a multitask learning approach to enhance the performance of the misogyny identification systems.
We incorporated diverse perspectives from annotators in our model design, considering gender and age across six profile groups.
This research advances content moderation and highlights the importance of embracing diverse perspectives to build effective online moderation systems.
arXiv Detail & Related papers (2024-06-22T15:06:08Z) - Bilingual Sexism Classification: Fine-Tuned XLM-RoBERTa and GPT-3.5 Few-Shot Learning [0.8192907805418581]
This study aims to improve sexism identification in bilingual contexts (English and Spanish) by leveraging natural language processing models.
We fine-tuned the XLM-RoBERTa model and separately used GPT-3.5 with few-shot learning prompts to classify sexist content.
arXiv Detail & Related papers (2024-06-11T14:15:33Z) - Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You [64.74707085021858]
We show that multilingual models suffer from significant gender biases just as monolingual models do.
We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models.
Our results show that not only do models exhibit strong gender biases but they also behave differently across languages.
arXiv Detail & Related papers (2024-01-29T12:02:28Z) - SemEval-2023 Task 10: Explainable Detection of Online Sexism [5.542286527528687]
We introduce SemEval Task 10 on the Explainable Detection of Online Sexism (EDOS)
We make three main contributions: i) a novel hierarchical taxonomy of sexist content, which includes granular vectors of sexism to aid explainability; ii) a new dataset of 20,000 social media comments with fine-grained labels, along with larger unlabelled datasets for model adaptation; andiii) baseline models as well as an analysis of the methods, results and errors for participant submissions to our task.
arXiv Detail & Related papers (2023-03-07T20:28:39Z) - CORGI-PM: A Chinese Corpus For Gender Bias Probing and Mitigation [28.38578407487603]
We propose a Chinese cOrpus foR Gender bIas Probing and Mitigation CORGI-PM, which contains 32.9k sentences with high-quality labels.
We address three challenges for automatic textual gender bias mitigation, which requires the models to detect, classify, and mitigate textual gender bias.
CORGI-PM is the first sentence-level Chinese corpus for gender bias probing and mitigation.
arXiv Detail & Related papers (2023-01-01T12:48:12Z) - SexWEs: Domain-Aware Word Embeddings via Cross-lingual Semantic
Specialisation for Chinese Sexism Detection in Social Media [23.246615034191553]
We develop a cross-lingual domain-aware semantic specialisation system for sexism detection.
We leverage semantic resources for sexism from a high-resource language (English) to specialise pre-trained word vectors in the target language (Chinese) to inject domain knowledge.
Compared with other specialisation approaches and Chinese baseline word vectors, our SexWEs shows an average score improvement of 0.033 and 0.064 in both intrinsic and extrinsic evaluations.
arXiv Detail & Related papers (2022-11-15T19:00:20Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Towards Understanding and Mitigating Social Biases in Language Models [107.82654101403264]
Large-scale pretrained language models (LMs) can be potentially dangerous in manifesting undesirable representational biases.
We propose steps towards mitigating social biases during text generation.
Our empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information.
arXiv Detail & Related papers (2021-06-24T17:52:43Z) - "Call me sexist, but...": Revisiting Sexism Detection Using
Psychological Scales and Adversarial Samples [2.029924828197095]
We outline the different dimensions of sexism by grounding them in their implementation in psychological scales.
From the scales, we derive a codebook for sexism in social media, which we use to annotate existing and novel datasets.
Results indicate that current machine learning models pick up on a very narrow set of linguistic markers of sexism and do not generalize well to out-of-domain examples.
arXiv Detail & Related papers (2020-04-27T13:07:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.