StereoKG: Data-Driven Knowledge Graph Construction for Cultural
Knowledge and Stereotypes
- URL: http://arxiv.org/abs/2205.14036v1
- Date: Fri, 27 May 2022 15:09:56 GMT
- Title: StereoKG: Data-Driven Knowledge Graph Construction for Cultural
Knowledge and Stereotypes
- Authors: Awantee Deshpande, Dana Ruiter, Marius Mosbach, Dietrich Klakow
- Abstract summary: We present a fully data-driven pipeline for generating a knowledge graph (KG) of cultural knowledge and stereotypes.
Our resulting KG covers 5 religious groups and 5 nationalities and can easily be extended to include more entities.
- Score: 17.916919837253108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Analyzing ethnic or religious bias is important for improving fairness,
accountability, and transparency of natural language processing models.
However, many techniques rely on human-compiled lists of bias terms, which are
expensive to create and are limited in coverage. In this study, we present a
fully data-driven pipeline for generating a knowledge graph (KG) of cultural
knowledge and stereotypes. Our resulting KG covers 5 religious groups and 5
nationalities and can easily be extended to include more entities. Our human
evaluation shows that the majority (59.2%) of non-singleton entries are
coherent and complete stereotypes. We further show that performing intermediate
masked language model training on the verbalized KG leads to a higher level of
cultural awareness in the model and has the potential to increase
classification performance on knowledge-crucial samples on a related task,
i.e., hate speech detection.
Related papers
- Scaling for Fairness? Analyzing Model Size, Data Composition, and Multilinguality in Vision-Language Bias [14.632649933582648]
We investigate how dataset composition, model size, and multilingual training affect gender and racial bias in a popular VLM, CLIP, and its open source variants.
To assess social perception bias, we measure the zero-shot performance on face images featuring socially charged terms.
arXiv Detail & Related papers (2025-01-22T21:08:30Z) - Risks of Cultural Erasure in Large Language Models [4.613949381428196]
We argue for the need of metricizable evaluations of language technologies that interrogate and account for historical power inequities.
We probe representations that a language model produces about different places around the world when asked to describe these contexts.
We analyze the cultures represented in the travel recommendations produced by a set of language model applications.
arXiv Detail & Related papers (2025-01-02T04:57:50Z) - Attributing Culture-Conditioned Generations to Pretraining Corpora [26.992883552982335]
We analyze how models associate entities with cultures based on pretraining data patterns.
We find that high-frequency cultures in pretraining data yield more generations with memorized symbols, while some low-frequency cultures produce none.
arXiv Detail & Related papers (2024-12-30T07:09:25Z) - HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection [0.0]
We introduce HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text Stereotype Detection), a framework that enhances model performance, minimises carbon footprint, and provides transparent, interpretable explanations.
We establish the Expanded Multi-Grain Stereotype dataset (EMGSD), comprising 57,201 labelled texts across six groups, including under-represented demographics like LGBTQ+ and regional stereotypes.
We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model using SHAP to generate token-level importance values, ensuring alignment with human understanding, and calculate explainability confidence scores by comparing SHAP and
arXiv Detail & Related papers (2024-09-17T22:06:46Z) - Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs)
By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases.
The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z) - Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks.
We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts.
We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z) - CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection.
It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs.
We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z) - Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition.
Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages.
Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z) - CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI
Collaboration for Large Language Models [52.25049362267279]
We present a Chinese Bias Benchmark dataset that consists of over 100K questions jointly constructed by human experts and generative language models.
The testing instances in the dataset are automatically derived from 3K+ high-quality templates manually authored with stringent quality control.
Extensive experiments demonstrate the effectiveness of the dataset in detecting model bias, with all 10 publicly available Chinese large language models exhibiting strong bias in certain categories.
arXiv Detail & Related papers (2023-06-28T14:14:44Z) - An Analysis of Social Biases Present in BERT Variants Across Multiple
Languages [0.0]
We investigate the bias present in monolingual BERT models across a diverse set of languages.
We propose a template-based method to measure any kind of bias, based on sentence pseudo-likelihood.
We conclude that current methods of probing for bias are highly language-dependent.
arXiv Detail & Related papers (2022-11-25T23:38:08Z) - O-Dang! The Ontology of Dangerous Speech Messages [53.15616413153125]
We present O-Dang!: The Ontology of Dangerous Speech Messages, a systematic and interoperable Knowledge Graph (KG)
O-Dang! is designed to gather and organize Italian datasets into a structured KG, according to the principles shared within the Linguistic Linked Open Data community.
It provides a model for encoding both gold standard and single-annotator labels in the KG.
arXiv Detail & Related papers (2022-07-13T11:50:05Z) - Improving Fairness in Large-Scale Object Recognition by CrowdSourced
Demographic Information [7.968124582214686]
Representing objects fairly in machine learning datasets will lead to models that are less biased towards a particular culture.
We propose a simple and general approach, based on crowdsourcing the demographic composition of the contributors.
We present analysis which leads to a much fairer coverage of the world compared to existing datasets.
arXiv Detail & Related papers (2022-06-02T22:55:10Z) - EnCBP: A New Benchmark Dataset for Finer-Grained Cultural Background
Prediction in English [25.38572483508948]
We augment natural language processing models with cultural background features.
We show that there are noticeable differences in linguistic expressions among five English-speaking countries and across four states in the US.
Our findings support the importance of cultural background modeling to a wide variety of NLP tasks and demonstrate the applicability of EnCBP in culture-related research.
arXiv Detail & Related papers (2022-03-28T04:57:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.