WikiMulti: a Corpus for Cross-Lingual Summarization
- URL: http://arxiv.org/abs/2204.11104v1
- Date: Sat, 23 Apr 2022 16:47:48 GMT
- Title: WikiMulti: a Corpus for Cross-Lingual Summarization
- Authors: Pavel Tikhonov, Valentin Malykh
- Abstract summary: Cross-lingual summarization is the task to produce a summary in one language for a source document in a different language.
We introduce WikiMulti - a new dataset for cross-lingual summarization based on Wikipedia articles in 15 languages.
- Score: 5.566656105144887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-lingual summarization (CLS) is the task to produce a summary in one
particular language for a source document in a different language. We introduce
WikiMulti - a new dataset for cross-lingual summarization based on Wikipedia
articles in 15 languages. As a set of baselines for further studies, we
evaluate the performance of existing cross-lingual abstractive summarization
methods on our dataset. We make our dataset publicly available here:
https://github.com/tikhonovpavel/wikimulti
Related papers
- A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model [15.596156608713347]
In real-world scenarios, news about an international event often involves multiple documents in different languages.
We construct a mixed-language multi-document news summarization dataset (MLMD-news)
This dataset contains four different languages and 10,992 source document cluster and target summary pairs.
arXiv Detail & Related papers (2024-10-13T08:15:33Z) - Automatic Data Retrieval for Cross Lingual Summarization [4.759360739268894]
Cross-lingual summarization involves the summarization of text written in one language to a different one.
In this work, we aim to perform cross-lingual summarization from English to Hindi.
arXiv Detail & Related papers (2023-12-22T09:13:24Z) - $\mu$PLAN: Summarizing using a Content Plan as Cross-Lingual Bridge [72.64847925450368]
Cross-lingual summarization consists of generating a summary in one language given an input document in a different language.
This work presents $mu$PLAN, an approach to cross-lingual summarization that uses an intermediate planning step as a cross-lingual bridge.
arXiv Detail & Related papers (2023-05-23T16:25:21Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - The RELX Dataset and Matching the Multilingual Blanks for Cross-Lingual
Relation Classification [0.0]
Current approaches for relation classification are mainly focused on the English language.
We propose two cross-lingual relation classification models: a baseline model based on Multilingual BERT and a new multilingual pretraining setup.
For evaluation, we introduce a new public benchmark dataset for cross-lingual relation classification in English, French, German, Spanish, and Turkish.
arXiv Detail & Related papers (2020-10-19T11:08:16Z) - WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive
Summarization [41.578594261746055]
We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems.
We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors.
We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article.
arXiv Detail & Related papers (2020-10-07T00:28:05Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.