Assessing the quality of sources in Wikidata across languages: a hybrid
approach
- URL: http://arxiv.org/abs/2109.09405v1
- Date: Mon, 20 Sep 2021 10:06:46 GMT
- Title: Assessing the quality of sources in Wikidata across languages: a hybrid
approach
- Authors: Gabriel Amaral, Alessandro Piscopo, Lucie-Aim\'ee Kaffee, Odinaldo
Rodrigues and Elena Simperl
- Abstract summary: We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
- Score: 64.05097584373979
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Wikidata is one of the most important sources of structured data on the web,
built by a worldwide community of volunteers. As a secondary source, its
contents must be backed by credible references; this is particularly important
as Wikidata explicitly encourages editors to add claims for which there is no
broad consensus, as long as they are corroborated by references. Nevertheless,
despite this essential link between content and references, Wikidata's ability
to systematically assess and assure the quality of its references remains
limited. To this end, we carry out a mixed-methods study to determine the
relevance, ease of access, and authoritativeness of Wikidata references, at
scale and in different languages, using online crowdsourcing, descriptive
statistics, and machine learning. Building on previous work of ours, we run a
series of microtasks experiments to evaluate a large corpus of references,
sampled from Wikidata triples with labels in several languages. We use a
consolidated, curated version of the crowdsourced assessments to train several
machine learning models to scale up the analysis to the whole of Wikidata. The
findings help us ascertain the quality of references in Wikidata, and identify
common challenges in defining and capturing the quality of user-generated
multilingual structured data on the web. We also discuss ongoing editorial
practices, which could encourage the use of higher-quality references in a more
immediate way. All data and code used in the study are available on GitHub for
feedback and further improvement and deployment by the research community.
Related papers
- Scholarly Wikidata: Population and Exploration of Conference Data in Wikidata using LLMs [4.721309965816974]
We propose to make scholarly data more accessible sustainably by leveraging Wikidata's infrastructure.
Our study focuses on data from 105 Semantic Web-related conferences and extends/adds more than 6000 entities in Wikidata.
arXiv Detail & Related papers (2024-11-13T15:34:52Z) - Wikidata as a seed for Web Extraction [4.273966905160028]
We present a framework that is able to identify and extract new facts that are published under multiple Web domains.
We take inspiration from ideas that are used to extract facts from textual collections and adapt them to extract facts from Web pages.
Our experiments show that we can achieve a mean performance of 84.07 at F1-score.
arXiv Detail & Related papers (2024-01-15T16:35:52Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - Enriching Wikidata with Linked Open Data [4.311189028205597]
Current linked open data (LOD) tools are not suitable to enrich large graphs like Wikidata.
We present a novel workflow that includes gap detection, source selection, schema alignment, and semantic validation.
Our experiments show that our workflow can enrich Wikidata with millions of novel statements from external LOD sources with a high quality.
arXiv Detail & Related papers (2022-07-01T01:50:24Z) - Improving Candidate Retrieval with Entity Profile Generation for
Wikidata Entity Linking [76.00737707718795]
We propose a novel candidate retrieval paradigm based on entity profiling.
We use the profile to query the indexed search engine to retrieve candidate entities.
Our approach complements the traditional approach of using a Wikipedia anchor-text dictionary.
arXiv Detail & Related papers (2022-02-27T17:38:53Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Wikidated 1.0: An Evolving Knowledge Graph Dataset of Wikidata's
Revision History [5.727994421498849]
We present Wikidated 1.0, a dataset of Wikidata's full revision history.
To the best of our knowledge, it constitutes the first large dataset of an evolving knowledge graph.
arXiv Detail & Related papers (2021-12-09T15:54:03Z) - Open Domain Question Answering over Virtual Documents: A Unified
Approach for Data and Text [62.489652395307914]
We use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA)
Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources.
We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.
arXiv Detail & Related papers (2021-10-16T00:11:21Z) - Multilingual Compositional Wikidata Questions [9.602430657819564]
We propose a method for creating a multilingual, parallel dataset of question-Query pairs grounded in Wikidata.
We use this data to train semantics for Hebrew, Kannada, Chinese and English to better understand the current strengths and weaknesses of multilingual semantic parsing.
arXiv Detail & Related papers (2021-08-07T19:40:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.