XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation
in Low Resource Languages
- URL: http://arxiv.org/abs/2303.12308v2
- Date: Tue, 18 Apr 2023 09:38:59 GMT
- Title: XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation
in Low Resource Languages
- Authors: Dhaval Taunk, Shivprasad Sagare, Anupam Patil, Shivansh Subramanian,
Manish Gupta and Vasudeva Varma
- Abstract summary: Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages.
We propose XWikiGen, which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text.
- Score: 11.581072296148031
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Lack of encyclopedic text contributors, especially on Wikipedia, makes
automated text generation for low resource (LR) languages a critical problem.
Existing work on Wikipedia text generation has focused on English only where
English reference articles are summarized to generate English Wikipedia pages.
But, for low-resource languages, the scarcity of reference articles makes
monolingual summarization ineffective in solving this problem. Hence, in this
work, we propose XWikiGen, which is the task of cross-lingual multi-document
summarization of text from multiple reference articles, written in various
languages, to generate Wikipedia-style text. Accordingly, we contribute a
benchmark dataset, XWikiRef, spanning ~69K Wikipedia articles covering five
domains and eight languages. We harness this dataset to train a two-stage
system where the input is a set of citations and a section title and the output
is a section-specific LR summary. The proposed system is based on a novel idea
of neural unsupervised extractive summarization to coarsely identify salient
information followed by a neural abstractive model to generate the
section-specific text. Extensive experiments show that multi-domain training is
better than the multi-lingual setup on average.
Related papers
- MegaWika: Millions of reports and their sources across 50 diverse
languages [74.3909725023673]
MegaWika consists of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials.
We process this dataset for a myriad of applications, including translating non-English articles for cross-lingual applications.
MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual.
arXiv Detail & Related papers (2023-07-13T20:04:02Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - WikiMulti: a Corpus for Cross-Lingual Summarization [5.566656105144887]
Cross-lingual summarization is the task to produce a summary in one language for a source document in a different language.
We introduce WikiMulti - a new dataset for cross-lingual summarization based on Wikipedia articles in 15 languages.
arXiv Detail & Related papers (2022-04-23T16:47:48Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive
Summarization [41.578594261746055]
We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems.
We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors.
We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.
arXiv Detail & Related papers (2020-10-07T00:28:05Z) - Design Challenges in Low-resource Cross-lingual Entity Linking [56.18957576362098]
Cross-lingual Entity Linking (XEL) is the problem of grounding mentions of entities in a foreign language text into an English knowledge base such as Wikipedia.
This paper focuses on the key step of identifying candidate English Wikipedia titles that correspond to a given foreign language mention.
We present a simple yet effective zero-shot XEL system, QuEL, that utilizes search engines query logs.
arXiv Detail & Related papers (2020-05-02T04:00:26Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.