SeeGULL Multilingual: a Dataset of Geo-Culturally Situated Stereotypes
- URL: http://arxiv.org/abs/2403.05696v1
- Date: Fri, 8 Mar 2024 22:09:58 GMT
- Title: SeeGULL Multilingual: a Dataset of Geo-Culturally Situated Stereotypes
- Authors: Mukul Bhutani, Kevin Robinson, Vinodkumar Prabhakaran, Shachi Dave,
Sunipa Dev
- Abstract summary: SeeGULL is a global-scale multilingual dataset of social stereotypes, spanning 20 languages, with human annotations across 23 regions, and demonstrate its utility in identifying gaps in model evaluations.
- Score: 18.991295993710224
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While generative multilingual models are rapidly being deployed, their safety
and fairness evaluations are largely limited to resources collected in English.
This is especially problematic for evaluations targeting inherently
socio-cultural phenomena such as stereotyping, where it is important to build
multi-lingual resources that reflect the stereotypes prevalent in respective
language communities. However, gathering these resources, at scale, in varied
languages and regions pose a significant challenge as it requires broad
socio-cultural knowledge and can also be prohibitively expensive. To overcome
this critical gap, we employ a recently introduced approach that couples LLM
generations for scale with culturally situated validations for reliability, and
build SeeGULL Multilingual, a global-scale multilingual dataset of social
stereotypes, containing over 25K stereotypes, spanning 20 languages, with human
annotations across 23 regions, and demonstrate its utility in identifying gaps
in model evaluations. Content warning: Stereotypes shared in this paper can be
offensive.
Related papers
- Socially Responsible Data for Large Multilingual Language Models [12.338723881042926]
Large Language Models (LLMs) have rapidly increased in size and apparent capabilities in the last three years.
Various efforts are striving for models to accommodate languages of communities outside of the Global North.
arXiv Detail & Related papers (2024-09-08T23:51:04Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - Multilingual large language models leak human stereotypes across language boundaries [25.903732543380528]
We investigate how stereotypical associations leak across four languages: English, Russian, Chinese, and Hindi.
Hindi appears to be the most susceptible to influence from other languages, while Chinese is the least.
arXiv Detail & Related papers (2023-12-12T10:24:17Z) - Building Socio-culturally Inclusive Stereotype Resources with Community
Engagement [9.131536842607069]
We demonstrate a socio-culturally aware expansion of evaluation resources in the Indian societal context, specifically for the harm of stereotyping.
The resultant resource increases the number of stereotypes known for and in the Indian context by over 1000 stereotypes across many unique identities.
arXiv Detail & Related papers (2023-07-20T01:26:34Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Fairness in Language Models Beyond English: Gaps and Challenges [11.62418844341466]
This paper presents a survey of fairness in multilingual and non-English contexts.
It highlights the shortcomings of current research and the difficulties faced by methods designed for English.
arXiv Detail & Related papers (2023-02-24T11:25:50Z) - Making a MIRACL: Multilingual Information Retrieval Across a Continuum
of Languages [62.730361829175415]
MIRACL is a multilingual dataset we have built for the WSDM 2023 Cup challenge.
It focuses on ad hoc retrieval across 18 different languages.
Our goal is to spur research that will improve retrieval across a continuum of languages.
arXiv Detail & Related papers (2022-10-18T16:47:18Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.