LANS: Large-scale Arabic News Summarization Corpus
- URL: http://arxiv.org/abs/2210.13600v1
- Date: Mon, 24 Oct 2022 20:54:01 GMT
- Title: LANS: Large-scale Arabic News Summarization Corpus
- Authors: Abdulaziz Alhamadani, Xuchao Zhang, Jianfeng He, Chang-Tien Lu
- Abstract summary: We build, LANS, a large-scale and diverse dataset for Arabic Text Summarization task.
LANS offers 8.4 million articles and their summaries extracted from newspapers websites metadata between 1999 and 2019.
- Score: 20.835296945483275
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Text summarization has been intensively studied in many languages, and some
languages have reached advanced stages. Yet, Arabic Text Summarization (ATS) is
still in its developing stages. Existing ATS datasets are either small or lack
diversity. We build, LANS, a large-scale and diverse dataset for Arabic Text
Summarization task. LANS offers 8.4 million articles and their summaries
extracted from newspapers websites metadata between 1999 and 2019. The
high-quality and diverse summaries are written by journalists from 22 major
Arab newspapers, and include an eclectic mix of at least more than 7 topics
from each source. We conduct an intrinsic evaluation on LANS by both automatic
and human evaluations. Human evaluation of 1000 random samples reports 95.4%
accuracy for our collected summaries, and automatic evaluation quantifies the
diversity and abstractness of the summaries. The dataset is publicly available
upon request.
Related papers
- Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles [136.84278943588652]
We propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event.
To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm.
The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference.
arXiv Detail & Related papers (2023-09-17T20:28:17Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Evaluation of Abstractive Summarisation Models with Machine Translation
in Deliberative Processes [23.249742737907905]
This dataset reflects difficulties of combining multiple narratives, mostly of poor grammatical quality, in a single text.
We report an extensive evaluation of a wide range of abstractive summarisation models in combination with an off-the-shelf machine translation model.
We obtain promising results regarding the fluency, consistency and relevance of the summaries produced.
arXiv Detail & Related papers (2021-10-12T09:23:57Z) - Does Summary Evaluation Survive Translation to Other Languages? [0.0]
We translate an existing English summarization dataset, SummEval dataset, to four different languages.
We analyze the scores from the automatic evaluation metrics in translated languages, as well as their correlation with human annotations in the source language.
arXiv Detail & Related papers (2021-09-16T17:35:01Z) - XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44
Languages [7.8288425529553916]
We present XL-Sum, a comprehensive and diverse dataset of 1 million professionally annotated article-summary pairs from BBC.
The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available.
XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation.
arXiv Detail & Related papers (2021-06-25T18:00:24Z) - Liputan6: A Large-scale Indonesian Dataset for Text Summarization [43.375797352517765]
We harvest articles from Liputan6.com, an online news portal, and obtain 215,827 document-summary pairs.
We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset.
arXiv Detail & Related papers (2020-11-02T02:01:12Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.