XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44
Languages
- URL: http://arxiv.org/abs/2106.13822v1
- Date: Fri, 25 Jun 2021 18:00:24 GMT
- Title: XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44
Languages
- Authors: Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Islam, Kazi Samin,
Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, Rifat Shahriyar
- Abstract summary: We present XL-Sum, a comprehensive and diverse dataset of 1 million professionally annotated article-summary pairs from BBC.
The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available.
XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
- Score: 7.8288425529553916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary works on abstractive text summarization have focused primarily
on high-resource languages like English, mostly due to the limited availability
of datasets for low/mid-resource ones. In this work, we present XL-Sum, a
comprehensive and diverse dataset comprising 1 million professionally annotated
article-summary pairs from BBC, extracted using a set of carefully designed
heuristics. The dataset covers 44 languages ranging from low to high-resource,
for many of which no public dataset is currently available. XL-Sum is highly
abstractive, concise, and of high quality, as indicated by human and intrinsic
evaluation. We fine-tune mT5, a state-of-the-art pretrained multilingual model,
with XL-Sum and experiment on multilingual and low-resource summarization
tasks. XL-Sum induces competitive results compared to the ones obtained using
similar monolingual datasets: we show higher than 11 ROUGE-2 scores on 10
languages we benchmark on, with some of them exceeding 15, as obtained by
multilingual training. Additionally, training on low-resource languages
individually also provides competitive performance. To the best of our
knowledge, XL-Sum is the largest abstractive summarization dataset in terms of
the number of samples collected from a single source and the number of
languages covered. We are releasing our dataset and models to encourage future
research on multilingual abstractive summarization. The resources can be found
at \url{https://github.com/csebuetnlp/xl-sum}.
Related papers
- The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages [21.018996007110324]
This dataset includes 41.8 million news articles in 14 different Indic languages (and English)
To the best of our knowledge, this is the largest collection of curated articles for Indic languages currently available.
arXiv Detail & Related papers (2023-05-10T03:07:17Z) - DN at SemEval-2023 Task 12: Low-Resource Language Text Classification
via Multilingual Pretrained Language Model Fine-tuning [0.0]
Most existing models and datasets for sentiment analysis are developed for high-resource languages, such as English and Chinese.
The AfriSenti-SemEval 2023 Shared Task 12 aims to fill this gap by evaluating sentiment analysis models on low-resource African languages.
We present our solution to the shared task, where we employed different multilingual XLM-R models with classification head trained on various data.
arXiv Detail & Related papers (2023-05-04T07:28:45Z) - UniMax: Fairer and more Effective Language Sampling for Large-Scale
Multilingual Pretraining [92.3702056505905]
We propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages.
We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases.
arXiv Detail & Related papers (2023-04-18T17:45:50Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+
Language Pairs [27.574815708395203]
CrossSum is a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs.
We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset.
We propose a multistage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language.
arXiv Detail & Related papers (2021-12-16T11:40:36Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - The Tatoeba Translation Challenge -- Realistic Data Sets for Low
Resource and Multilingual MT [0.0]
This paper describes the development of a new benchmark for machine translation that provides training and test data for thousands of language pairs.
The main goal is to trigger the development of open translation tools and models with a much broader coverage of the World's languages.
arXiv Detail & Related papers (2020-10-13T13:12:21Z) - MLSUM: The Multilingual Summarization Corpus [29.943949944682196]
MLSUM is the first large-scale MultiLingual SUMmarization dataset.
It contains 1.5M+ article/summary pairs in five different languages.
arXiv Detail & Related papers (2020-04-30T15:58:34Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.