Related papers: From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets

From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets

URL: http://arxiv.org/abs/2404.17874v1
Date: Sat, 27 Apr 2024 12:10:10 GMT
Title: From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets
Authors: Manuel Tonneau, Diyi Liu, Samuel Fraiberger, Ralph Schroeder, Scott A. Hale, Paul Röttger,
Abstract summary: Hate speech datasets have traditionally been developed by language. We evaluate cultural bias in HS datasets by leveraging two interrelated cultural proxies: language and geography. We find that HS datasets for English, Arabic and Spanish exhibit a strong geo-cultural bias.
Score: 10.264294331399434
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Perceptions of hate can vary greatly across cultural contexts. Hate speech (HS) datasets, however, have traditionally been developed by language. This hides potential cultural biases, as one language may be spoken in different countries home to different cultures. In this work, we evaluate cultural bias in HS datasets by leveraging two interrelated cultural proxies: language and geography. We conduct a systematic survey of HS datasets in eight languages and confirm past findings on their English-language bias, but also show that this bias has been steadily decreasing in the past few years. For three geographically-widespread languages -- English, Arabic and Spanish -- we then leverage geographical metadata from tweets to approximate geo-cultural contexts by pairing language and country information. We find that HS datasets for these languages exhibit a strong geo-cultural bias, largely overrepresenting a handful of countries (e.g., US and UK for English) relative to their prominence in both the broader social media population and the general population speaking these languages. Based on these findings, we formulate recommendations for the creation of future HS datasets.

Related papers

Measuring South Asian Biases in Large Language Models [1.5903891569492878]
This work addresses gaps by conducting a multilingual and intersectional analysis of Large Language Models (LLMs)<n>We construct a culturally grounded bias lexicon capturing previously unexplored intersectional dimensions including gender, religion, marital status, and number of children.<n>We evaluate two self-debiasing strategies to measure their effectiveness in reducing culturally specific bias in Indo-Aryan and Dravidian languages.
arXiv Detail & Related papers (2025-05-24T02:18:17Z)
CARE: Aligning Language Models for Regional Cultural Awareness [28.676469530858924]
Existing language models (LMs) often exhibit a Western-centric bias and struggle to represent diverse cultural knowledge. Previous attempts to address this rely on synthetic data and express cultural knowledge only in English. We first introduce CARE, a multilingual resource of 24.1k responses with human preferences on 2,580 questions about Chinese and Arab cultures.
arXiv Detail & Related papers (2025-04-07T14:57:06Z)
Detecting Linguistic Diversity on Social Media [1.3108652488669732]
We use published census data as the ground truth and the social media sub-corpus from the Corpus of Global Language Use as our alternative data source. We identify the language conditions of each tweet in the social media data set and validated our results with two language identification models. The results suggest that social media language data has the possibility to provide a rich source of spatial and temporal insights on the linguistic profile of a place.
arXiv Detail & Related papers (2025-02-28T16:56:34Z)
Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs [2.5212698425008377]
Large Language Models (LLMs) are becoming increasingly capable across global languages. However, the ability to communicate across languages does not necessarily translate to appropriate cultural representations. We compare two families of models: Google's Gemma models and OpenAI's turbo-series. We find no consistent relationships between language capabilities and cultural alignment.
arXiv Detail & Related papers (2025-02-23T11:02:41Z)
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines [74.25764182510295]
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English. We introduce World Cuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points.
arXiv Detail & Related papers (2024-10-16T16:11:49Z)
BanStereoSet: A Dataset to Measure Stereotypical Social Biases in LLMs for Bangla [0.0]
This study presents BanStereoSet, a dataset designed to evaluate stereotypical social biases in multilingual LLMs for the Bangla language. Our dataset consists of 1,194 sentences spanning 9 categories of bias: race, profession, gender, ageism, beauty in profession, region, caste, and religion.
arXiv Detail & Related papers (2024-09-18T02:02:30Z)
See It from My Perspective: How Language Affects Cultural Bias in Image Understanding [60.70852566256668]
Vision-language models (VLMs) can respond to queries about images in many languages. We characterize the Western bias of VLMs in image understanding and investigate the role that language plays in this disparity.
arXiv Detail & Related papers (2024-06-17T15:49:51Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
Global Voices, Local Biases: Socio-Cultural Prejudices across Languages [22.92083941222383]
Human biases are ubiquitous but not uniform; disparities exist across linguistic, cultural, and societal borders. In this work, we scale the Word Embedding Association Test (WEAT) to 24 languages, enabling broader studies. To encompass more widely prevalent societal biases, we examine new bias dimensions across toxicity, ableism, and more.
arXiv Detail & Related papers (2023-10-26T17:07:50Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis [44.17106903728264]
Most hate speech datasets neglect the cultural diversity within a single language. To address this, we introduce CREHate, a CRoss-cultural English Hate speech dataset. Only 56.2% of the posts in CREHate achieve consensus among all countries, with the highest pairwise label difference rate of 26%.
arXiv Detail & Related papers (2023-08-31T13:14:47Z)
Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP. We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba. Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region. All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z)
Comparing Biases and the Impact of Multilingual Training across Multiple Languages [70.84047257764405]
We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task. We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender. Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture.
arXiv Detail & Related papers (2023-05-18T18:15:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.