SexWEs: Domain-Aware Word Embeddings via Cross-lingual Semantic
Specialisation for Chinese Sexism Detection in Social Media
- URL: http://arxiv.org/abs/2211.08447v3
- Date: Thu, 30 Mar 2023 21:59:01 GMT
- Title: SexWEs: Domain-Aware Word Embeddings via Cross-lingual Semantic
Specialisation for Chinese Sexism Detection in Social Media
- Authors: Aiqi Jiang, Arkaitz Zubiaga
- Abstract summary: We develop a cross-lingual domain-aware semantic specialisation system for sexism detection.
We leverage semantic resources for sexism from a high-resource language (English) to specialise pre-trained word vectors in the target language (Chinese) to inject domain knowledge.
Compared with other specialisation approaches and Chinese baseline word vectors, our SexWEs shows an average score improvement of 0.033 and 0.064 in both intrinsic and extrinsic evaluations.
- Score: 23.246615034191553
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of sexism detection is to mitigate negative online content targeting
certain gender groups of people. However, the limited availability of labeled
sexism-related datasets makes it problematic to identify online sexism for
low-resource languages. In this paper, we address the task of automatic sexism
detection in social media for one low-resource language -- Chinese. Rather than
collecting new sexism data or building cross-lingual transfer learning models,
we develop a cross-lingual domain-aware semantic specialisation system in order
to make the most of existing data. Semantic specialisation is a technique for
retrofitting pre-trained distributional word vectors by integrating external
linguistic knowledge (such as lexico-semantic relations) into the specialised
feature space. To do this, we leverage semantic resources for sexism from a
high-resource language (English) to specialise pre-trained word vectors in the
target language (Chinese) to inject domain knowledge. We demonstrate the
benefit of our sexist word embeddings (SexWEs) specialised by our framework via
intrinsic evaluation of word similarity and extrinsic evaluation of sexism
detection. Compared with other specialisation approaches and Chinese baseline
word vectors, our SexWEs shows an average score improvement of 0.033 and 0.064
in both intrinsic and extrinsic evaluations, respectively. The ablative results
and visualisation of SexWEs also prove the effectiveness of our framework on
retrofitting word vectors in low-resource languages.
Related papers
- The Lou Dataset -- Exploring the Impact of Gender-Fair Language in German Text Classification [57.06913662622832]
Gender-fair language fosters inclusion by addressing all genders or using neutral forms.
Gender-fair language substantially impacts predictions by flipping labels, reducing certainty, and altering attention patterns.
While we offer initial insights on the effect on German text classification, the findings likely apply to other languages.
arXiv Detail & Related papers (2024-09-26T15:08:17Z) - Beyond Binary Gender: Evaluating Gender-Inclusive Machine Translation with Ambiguous Attitude Words [85.48043537327258]
Existing machine translation gender bias evaluations are primarily focused on male and female genders.
This study presents a benchmark AmbGIMT (Gender-Inclusive Machine Translation with Ambiguous attitude words)
We propose a novel process to evaluate gender bias based on the Emotional Attitude Score (EAS), which is used to quantify ambiguous attitude words.
arXiv Detail & Related papers (2024-07-23T08:13:51Z) - Target-Agnostic Gender-Aware Contrastive Learning for Mitigating Bias in
Multilingual Machine Translation [28.471506840241602]
Gender bias is a significant issue in machine translation, leading to ongoing research efforts in developing bias mitigation techniques.
We propose a bias mitigation method based on a novel approach.
Gender-Aware Contrastive Learning, GACL, encodes contextual gender information into the representations of non-explicit gender words.
arXiv Detail & Related papers (2023-05-23T12:53:39Z) - Analyzing Gender Representation in Multilingual Models [59.21915055702203]
We focus on the representation of gender distinctions as a practical case study.
We examine the extent to which the gender concept is encoded in shared subspaces across different languages.
arXiv Detail & Related papers (2022-04-20T00:13:01Z) - Under the Morphosyntactic Lens: A Multifaceted Evaluation of Gender Bias
in Speech Translation [20.39599469927542]
Gender bias is largely recognized as a problematic phenomenon affecting language technologies.
Most of current evaluation practices adopt a word-level focus on a narrow set of occupational nouns under synthetic conditions.
Such protocols overlook key features of grammatical gender languages, which are characterized by morphosyntactic chains of gender agreement.
arXiv Detail & Related papers (2022-03-18T11:14:16Z) - Gender Bias in Text: Labeled Datasets and Lexicons [0.30458514384586394]
There is a lack of gender bias datasets and lexicons for automating the detection of gender bias.
We provide labeled datasets and exhaustive lexicons by collecting, annotating, and augmenting relevant sentences.
The released datasets and lexicons span multiple bias subtypes including: Generic He, Generic She, Explicit Marking of Sex, and Gendered Neologisms.
arXiv Detail & Related papers (2022-01-21T12:44:51Z) - SWSR: A Chinese Dataset and Lexicon for Online Sexism Detection [9.443571652110663]
We propose the first Chinese sexism dataset -- Sina Weibo Sexism Review (SWSR) dataset -- and a large Chinese lexicon SexHateLex.
SWSR dataset provides labels at different levels of granularity including (i) sexism or non-sexism, (ii) sexism category and (iii) target type.
We conduct experiments for the three sexism classification tasks making use of state-of-the-art machine learning models.
arXiv Detail & Related papers (2021-08-06T12:06:40Z) - Learning Domain-Specialised Representations for Cross-Lingual Biomedical
Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL)
We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task.
We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - "Call me sexist, but...": Revisiting Sexism Detection Using
Psychological Scales and Adversarial Samples [2.029924828197095]
We outline the different dimensions of sexism by grounding them in their implementation in psychological scales.
From the scales, we derive a codebook for sexism in social media, which we use to annotate existing and novel datasets.
Results indicate that current machine learning models pick up on a very narrow set of linguistic markers of sexism and do not generalize well to out-of-domain examples.
arXiv Detail & Related papers (2020-04-27T13:07:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.