Language-agnostic Topic Classification for Wikipedia
- URL: http://arxiv.org/abs/2103.00068v1
- Date: Fri, 26 Feb 2021 22:17:50 GMT
- Title: Language-agnostic Topic Classification for Wikipedia
- Authors: Isaac Johnson, Martin Gerlach and Diego S\'aez-Trumper
- Abstract summary: We propose a language-agnostic approach based on the links in an article for classifying articles into a taxonomy of topics.
We show that it matches the performance of a language-dependent approach while being simpler and having much greater coverage.
- Score: 1.950869817974852
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A major challenge for many analyses of Wikipedia dynamics -- e.g., imbalances
in content quality, geographic differences in what content is popular, what
types of articles attract more editor discussion -- is grouping the very
diverse range of Wikipedia articles into coherent, consistent topics. This
problem has been addressed using various approaches based on Wikipedia's
category network, WikiProjects, and external taxonomies. However, these
approaches have always been limited in their coverage: typically, only a small
subset of articles can be classified, or the method cannot be applied across
(the more than 300) languages on Wikipedia. In this paper, we propose a
language-agnostic approach based on the links in an article for classifying
articles into a taxonomy of topics that can be easily applied to (almost) any
language and article on Wikipedia. We show that it matches the performance of a
language-dependent approach while being simpler and having much greater
coverage.
Related papers
- Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages [0.19698344608599344]
We propose a novel computational framework for modeling the quality of Wikipedia articles.
Our framework is based on language-agnostic structural features extracted from the articles.
We have built datasets with the feature values and quality scores of all revisions of all articles in the existing language versions of Wikipedia.
arXiv Detail & Related papers (2024-04-15T13:07:31Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Orphan Articles: The Dark Matter of Wikipedia [13.290424502717734]
We conduct the first systematic study of orphan articles, which are articles without any incoming links from other Wikipedia articles.
We find that a surprisingly large extent of content, roughly 15% (8.8M) of all articles, is de facto invisible to readers navigating Wikipedia.
We also provide causal evidence through a quasi-experiment that adding new incoming links to orphans (de-orphanization) leads to a statistically significant increase of their visibility.
arXiv Detail & Related papers (2023-06-06T18:04:33Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Generating Wikipedia Article Sections from Diverse Data Sources [57.23574577984244]
We benchmark several training and decoding strategies on WikiTableT.
Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they sometimes struggle with coherence.
arXiv Detail & Related papers (2020-12-29T19:35:34Z) - Crosslingual Topic Modeling with WikiPDA [15.198979978589476]
We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA)
It learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics.
We show its utility in two applications: a study of topical biases in 28 Wikipedia editions, and crosslingual supervised classification.
arXiv Detail & Related papers (2020-09-23T15:19:27Z) - Multiple Texts as a Limiting Factor in Online Learning: Quantifying
(Dis-)similarities of Knowledge Networks across Languages [60.00219873112454]
We investigate the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted.
Since Wikipedia is a central part of the web-based information landscape, this indicates a language-related, linguistic bias.
The article builds a bridge between reading research, educational science, Wikipedia research and computational linguistics.
arXiv Detail & Related papers (2020-08-05T11:11:55Z) - Design Challenges in Low-resource Cross-lingual Entity Linking [56.18957576362098]
Cross-lingual Entity Linking (XEL) is the problem of grounding mentions of entities in a foreign language text into an English knowledge base such as Wikipedia.
This paper focuses on the key step of identifying candidate English Wikipedia titles that correspond to a given foreign language mention.
We present a simple yet effective zero-shot XEL system, QuEL, that utilizes search engines query logs.
arXiv Detail & Related papers (2020-05-02T04:00:26Z) - What is Trending on Wikipedia? Capturing Trends and Language Biases
Across Wikipedia Editions [4.916670182199368]
We propose an automatic evaluation and comparison of the browsing behavior of Wikipedia readers.
As an example, we focus on English, French, and Russian languages during the last four months of 2018.
The proposed method has three steps. Firstly, it extracts the most trending articles over a chosen period of time.
Secondly, it performs a semi-supervised topic extraction and thirdly, it compares topics across languages.
arXiv Detail & Related papers (2020-02-17T11:04:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.