Wiki-based Communities of Interest: Demographics and Outliers
- URL: http://arxiv.org/abs/2303.09189v2
- Date: Fri, 17 Mar 2023 08:34:30 GMT
- Title: Wiki-based Communities of Interest: Demographics and Outliers
- Authors: Hiba Arnaout, Simon Razniewski, Jeff Z. Pan
- Abstract summary: Identified from Wiki-based sources, the data covers 7.5k communities, such as members of the White House Coronavirus Task Force.
We release subject-centric and group-centric datasets in format, as well as a browsing interface.
- Score: 18.953455338226103
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we release data about demographic information and outliers of
communities of interest. Identified from Wiki-based sources, mainly Wikidata,
the data covers 7.5k communities, such as members of the White House
Coronavirus Task Force, and 345k subjects, e.g., Deborah Birx. We describe the
statistical inference methodology adopted to mine such data. We release
subject-centric and group-centric datasets in JSON format, as well as a
browsing interface. Finally, we forsee three areas this research can have an
impact on: in social sciences research, it provides a resource for demographic
analyses; in web-scale collaborative encyclopedias, it serves as an edit
recommender to fill knowledge gaps; and in web search, it offers lists of
salient statements about queried subjects for higher user engagement.
Related papers
- Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia [49.80565462746646]
We introduce the InfoGap method -- an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level.
We evaluate InfoGap by analyzing LGBT people's portrayals, across 2.7K biography pages on English, Russian, and French Wikipedias.
arXiv Detail & Related papers (2024-10-05T20:40:49Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Knowledge Graph Representation for Political Information Sources [16.959319157216466]
We analyze data collected from two news portals, Breitbart News (BN) and New York Times (NYT)
Our research findings are presented through knowledge graphs, utilizing a dataset spanning 11.5 years gathered from BN and NYT media portals.
arXiv Detail & Related papers (2024-04-04T13:36:01Z) - Subdivisions and Crossroads: Identifying Hidden Community Structures in
a Data Archive's Citation Network [1.6631602844999724]
This paper analyzes the community structure of an authoritative network of datasets cited in academic publications.
We identify communities of social science datasets and fields of research connected through shared data use.
Our research reveals the hidden structure of data reuse and demonstrates how interdisciplinary research communities organize around datasets as shared scientific inputs.
arXiv Detail & Related papers (2022-05-17T14:18:49Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Common Misconceptions about Population Data [5.606904856295946]
This article discusses a diverse range of misconceptions about population data that we believe anybody who works with such data needs to be aware of.
The massive size of such databases is often mistaken as a guarantee for valid inferences on the population of interest.
We conclude with a set of recommendations for inference when using population data.
arXiv Detail & Related papers (2021-12-20T23:54:49Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Analyzing Race and Country of Citizenship Bias in Wikidata [2.6081347116384728]
We examine the race and citizenship bias in general and in regards to STEM representation for scientists, software developers, and engineers.
We discovered that there is an overrepresentation of white individuals and those with citizenship in Europe and North America.
We have found and linked to Wikidata additional data about STEM scientists from the minorities.
arXiv Detail & Related papers (2021-08-11T19:04:15Z) - Retiring Adult: New Datasets for Fair Machine Learning [47.27417042497261]
UCI Adult has served as the basis for the development and comparison of many algorithmic fairness interventions.
We reconstruct a superset of the UCI Adult data from available US Census sources and reveal idiosyncrasies of the UCI Adult dataset that limit its external validity.
Our primary contribution is a suite of new datasets that extend the existing data ecosystem for research on fair machine learning.
arXiv Detail & Related papers (2021-08-10T19:19:41Z) - REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets [64.76453161039973]
REVISE (REvealing VIsual biaSEs) is a tool that assists in the investigation of a visual dataset.
It surfacing potential biases along three dimensions: (1) object-based, (2) person-based, and (3) geography-based.
arXiv Detail & Related papers (2020-04-16T23:54:37Z) - Ontologies in CLARIAH: Towards Interoperability in History, Language and
Media [0.05277024349608833]
One of the most important goals of digital humanities is to provide researchers with data and tools for new research questions.
The FAIR principles provide a framework as these state that data needs to be: Findable, as they are often scattered among various sources; Accessible, since some might be offline or behind paywalls; Interoperable, thus using standard knowledge representation formats and shared.
We describe the tools developed and integrated in the Dutch national project CLARIAH to address these issues.
arXiv Detail & Related papers (2020-04-06T17:38:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.