Analyzing Race and Country of Citizenship Bias in Wikidata
- URL: http://arxiv.org/abs/2108.05412v1
- Date: Wed, 11 Aug 2021 19:04:15 GMT
- Title: Analyzing Race and Country of Citizenship Bias in Wikidata
- Authors: Zaina Shaik, Filip Ilievski, Fred Morstatter
- Abstract summary: We examine the race and citizenship bias in general and in regards to STEM representation for scientists, software developers, and engineers.
We discovered that there is an overrepresentation of white individuals and those with citizenship in Europe and North America.
We have found and linked to Wikidata additional data about STEM scientists from the minorities.
- Score: 2.6081347116384728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As an open and collaborative knowledge graph created by users and bots, it is
possible that the knowledge in Wikidata is biased in regards to multiple
factors such as gender, race, and country of citizenship. Previous work has
mostly studied the representativeness of Wikidata knowledge in terms of genders
of people. In this paper, we examine the race and citizenship bias in general
and in regards to STEM representation for scientists, software developers, and
engineers. By comparing Wikidata queries to real-world datasets, we identify
the differences in representation to characterize the biases present in
Wikidata. Through this analysis, we discovered that there is an
overrepresentation of white individuals and those with citizenship in Europe
and North America; the rest of the groups are generally underrepresented. Based
on these findings, we have found and linked to Wikidata additional data about
STEM scientists from the minorities. This data is ready to be inserted into
Wikidata with a bot. Increasing representation of minority race and country of
citizenship groups can create a more accurate portrayal of individuals in STEM.
Related papers
- Towards a Brazilian History Knowledge Graph [50.26735825937335]
We construct a knowledge graph for Brazilian history based on the Brazilian Dictionary of Historical Biographies (DHBB) and Wikipedia/Wikidata.
We show that many terms/entities described in the DHBB do not have corresponding concepts (or Q items) in Wikidata.
arXiv Detail & Related papers (2024-03-28T22:05:32Z) - Wiki-based Communities of Interest: Demographics and Outliers [18.953455338226103]
Identified from Wiki-based sources, the data covers 7.5k communities, such as members of the White House Coronavirus Task Force.
We release subject-centric and group-centric datasets in format, as well as a browsing interface.
arXiv Detail & Related papers (2023-03-16T09:58:11Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - Does Wikidata Support Analogical Reasoning? [17.68704739786042]
We investigate whether the knowledge in Wikidata supports analogical reasoning.
We show that Wikidata can be used to create data for analogy classification.
We devise a set of metrics to guide an automatic method for extracting analogies from Wikidata.
arXiv Detail & Related papers (2022-10-02T20:46:52Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Enriching Wikidata with Linked Open Data [4.311189028205597]
Current linked open data (LOD) tools are not suitable to enrich large graphs like Wikidata.
We present a novel workflow that includes gap detection, source selection, schema alignment, and semantic validation.
Our experiments show that our workflow can enrich Wikidata with millions of novel statements from external LOD sources with a high quality.
arXiv Detail & Related papers (2022-07-01T01:50:24Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Survey on English Entity Linking on Wikidata [3.8289963781051415]
Wikidata is a frequently updated, community-driven, and multilingual knowledge graph.
Current Wikidata-specific Entity Linking datasets do not differ in their annotation scheme from schemes for other knowledge graphs like DBpedia.
Almost all approaches employ specific properties like labels and sometimes descriptions but ignore characteristics such as the hyper-relational structure.
arXiv Detail & Related papers (2021-12-03T16:02:42Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - One Label, One Billion Faces: Usage and Consistency of Racial Categories
in Computer Vision [75.82110684355979]
We study the racial system encoded by computer vision datasets supplying categorical race labels for face images.
We find that each dataset encodes a substantially unique racial system, despite nominally equivalent racial categories.
We find evidence that racial categories encode stereotypes, and exclude ethnic groups from categories on the basis of nonconformity to stereotypes.
arXiv Detail & Related papers (2021-02-03T22:50:04Z) - Commonsense Knowledge in Wikidata [3.8359194344969807]
This paper investigates whether Wikidata con-tains commonsense knowledge which is complementary to existing commonsense sources.
We map the relations of Wikidata to ConceptNet, which we also leverage to integrate Wikidata-CS into an existing consolidated commonsense graph.
arXiv Detail & Related papers (2020-08-18T18:23:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.