Crosslingual Topic Modeling with WikiPDA
- URL: http://arxiv.org/abs/2009.11207v2
- Date: Sun, 14 Feb 2021 13:28:18 GMT
- Title: Crosslingual Topic Modeling with WikiPDA
- Authors: Tiziano Piccardi, Robert West
- Abstract summary: We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA)
It learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics.
We show its utility in two applications: a study of topical biases in 28 Wikipedia editions, and crosslingual supervised classification.
- Score: 15.198979978589476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a
crosslingual topic model that learns to represent Wikipedia articles written in
any language as distributions over a common set of language-independent topics.
It leverages the fact that Wikipedia articles link to each other and are mapped
to concepts in the Wikidata knowledge base, such that, when represented as bags
of links, articles are inherently language-independent. WikiPDA works in two
steps, by first densifying bags of links using matrix completion and then
training a standard monolingual topic model. A human evaluation shows that
WikiPDA produces more coherent topics than monolingual text-based LDA, thus
offering crosslinguality at no cost. We demonstrate WikiPDA's utility in two
applications: a study of topical biases in 28 Wikipedia editions, and
crosslingual supervised classification. Finally, we highlight WikiPDA's
capacity for zero-shot language transfer, where a model is reused for new
languages without any fine-tuning. Researchers can benefit from WikiPDA as a
practical tool for studying Wikipedia's content across its 299 language
editions in interpretable ways, via an easy-to-use library publicly available
at https://github.com/epfl-dlab/WikiPDA.
Related papers
- An Open Multilingual System for Scoring Readability of Wikipedia [3.992677070507323]
We develop a multilingual model to score the readability of Wikipedia articles.
We create a novel multilingual dataset spanning 14 languages, by matching articles from Wikipedia to simplified Wikipedia and online childrens.
We show that our model performs well in a zero-shot scenario, yielding a ranking accuracy of more than 80% across 14 languages.
arXiv Detail & Related papers (2024-06-03T23:07:18Z) - XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation
in Low Resource Languages [11.581072296148031]
Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages.
We propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text.
arXiv Detail & Related papers (2023-03-22T04:52:43Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - Instilling Type Knowledge in Language Models via Multi-Task QA [13.244420493711981]
We introduce a method to instill fine-grained type knowledge in language models with text-to-text pre-training on type-centric questions.
We create the WikiWiki dataset: entities and passages from 10M Wikipedia articles linked to the Wikidata knowledge graph with 41K types.
Models trained on WikiWiki achieve state-of-the-art performance in zero-shot dialog state tracking benchmarks, accurately infer entity types in Wikipedia articles, and can discover new types deemed useful by human judges.
arXiv Detail & Related papers (2022-04-28T22:06:32Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Assessing the quality of sources in Wikidata across languages: a hybrid
approach [64.05097584373979]
We run a series of microtasks experiments to evaluate a large corpus of references, sampled from Wikidata triples with labels in several languages.
We use a consolidated, curated version of the crowdsourced assessments to train several machine learning models to scale up the analysis to the whole of Wikidata.
The findings help us ascertain the quality of references in Wikidata, and identify common challenges in defining and capturing the quality of user-generated multilingual structured data on the web.
arXiv Detail & Related papers (2021-09-20T10:06:46Z) - Language-agnostic Topic Classification for Wikipedia [1.950869817974852]
We propose a language-agnostic approach based on the links in an article for classifying articles into a taxonomy of topics.
We show that it matches the performance of a language-dependent approach while being simpler and having much greater coverage.
arXiv Detail & Related papers (2021-02-26T22:17:50Z) - Multiple Texts as a Limiting Factor in Online Learning: Quantifying
(Dis-)similarities of Knowledge Networks across Languages [60.00219873112454]
We investigate the hypothesis that the extent to which one obtains information on a given topic through Wikipedia depends on the language in which it is consulted.
Since Wikipedia is a central part of the web-based information landscape, this indicates a language-related, linguistic bias.
The article builds a bridge between reading research, educational science, Wikipedia research and computational linguistics.
arXiv Detail & Related papers (2020-08-05T11:11:55Z) - Design Challenges in Low-resource Cross-lingual Entity Linking [56.18957576362098]
Cross-lingual Entity Linking (XEL) is the problem of grounding mentions of entities in a foreign language text into an English knowledge base such as Wikipedia.
This paper focuses on the key step of identifying candidate English Wikipedia titles that correspond to a given foreign language mention.
We present a simple yet effective zero-shot XEL system, QuEL, that utilizes search engines query logs.
arXiv Detail & Related papers (2020-05-02T04:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.