Related papers: How Good is Your Wikipedia?

Related papers

Factual Inconsistencies in Multilingual Wikipedia Tables [5.395647076142643]
This study investigates cross-lingual inconsistencies in Wikipedia's structured content.<n>We develop a methodology to collect, align, and analyze tables from Wikipedia multilingual articles.<n>These insights have implications for factual verification, multilingual knowledge interaction, and design for reliable AI systems.
arXiv Detail & Related papers (2025-07-24T13:46:14Z)
SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods [1.2091341579150698]
We release datasets of sentences containing polysemous words across ten low-resource languages.<n>To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method.<n>Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation.
arXiv Detail & Related papers (2025-05-29T17:48:08Z)
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models [52.22235443948351]
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs)<n>Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale.<n>JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings.
arXiv Detail & Related papers (2025-05-28T11:06:54Z)
On the effective transfer of knowledge from English to Hindi Wikipedia [4.427603894929721]
We propose a lightweight framework to enhance knowledge equity between English and Hindi. In case the English Wikipedia page is not up-to-date, our framework adapts it to align with Wikipedia's distinctive style. Our framework effectively generates new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles respectively by 65% and 62% according to automatic and human judgment-based evaluations.
arXiv Detail & Related papers (2024-12-07T17:43:21Z)
Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages [0.19698344608599344]
We propose a novel computational framework for modeling the quality of Wikipedia articles. Our framework is based on language-agnostic structural features extracted from the articles. We have built datasets with the feature values and quality scores of all revisions of all articles in the existing language versions of Wikipedia.
arXiv Detail & Related papers (2024-04-15T13:07:31Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia [14.325320851640084]
We propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia. Each sentence is extracted from the entire revision history of English Wikipedia. WikiSQE has about 3.4 M sentences with 153 quality labels.
arXiv Detail & Related papers (2023-05-10T06:45:13Z)
XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages [11.581072296148031]
Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages. We propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text.
arXiv Detail & Related papers (2023-03-22T04:52:43Z)
Mapping Process for the Task: Wikidata Statements to Text as Wikipedia Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level. The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia. We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z)
Efficient Entity Candidate Generation for Low-Resource Languages [13.789451365205665]
Candidate generation is a crucial module in entity linking. It plays a key role in multiple NLP tasks that have been proven to beneficially leverage knowledge bases. This paper constitutes an in-depth analysis of the candidate generation problem in the context of cross-lingual entity linking.
arXiv Detail & Related papers (2022-06-30T09:49:53Z)
Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection [83.3580786484122]
We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We argue that privileging any corpus as high quality entails a language ideology.
arXiv Detail & Related papers (2022-01-25T17:20:04Z)
Assessing the quality of sources in Wikidata across languages: a hybrid approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages. We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata. The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z)
Language-agnostic Topic Classification for Wikipedia [1.950869817974852]
We propose a language-agnostic approach based on the links in an article for classifying articles into a taxonomy of topics. We show that it matches the performance of a language-dependent approach while being simpler and having much greater coverage.
arXiv Detail & Related papers (2021-02-26T22:17:50Z)
Multiple Texts as a Limiting Factor in Online Learning: Quantifying (Dis-)similarities of Knowledge Networks across Languages [60.00219873112454]
We investigate the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted. Since Wikipedia is a central part of the web-based information landscape, this indicates a language-related, linguistic bias. The article builds a bridge between reading research, educational science, Wikipedia research and computational linguistics.
arXiv Detail & Related papers (2020-08-05T11:11:55Z)
Design Challenges in Low-resource Cross-lingual Entity Linking [56.18957576362098]
Cross-lingual Entity Linking (XEL) is the problem of grounding mentions of entities in a foreign language text into an English knowledge base such as Wikipedia. This paper focuses on the key step of identifying candidate English Wikipedia titles that correspond to a given foreign language mention. We present a simple yet effective zero-shot XEL system, QuEL, that utilizes search engines query logs.
arXiv Detail & Related papers (2020-05-02T04:00:26Z)
Improving Candidate Generation for Low-resource Cross-lingual Entity Linking [81.41804263432684]
Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. In this paper, we propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios.
arXiv Detail & Related papers (2020-03-03T05:32:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.