Liputan6: A Large-scale Indonesian Dataset for Text Summarization
- URL: http://arxiv.org/abs/2011.00679v1
- Date: Mon, 2 Nov 2020 02:01:12 GMT
- Title: Liputan6: A Large-scale Indonesian Dataset for Text Summarization
- Authors: Fajri Koto and Jey Han Lau and Timothy Baldwin
- Abstract summary: We harvest articles from Liputan6.com, an online news portal, and obtain 215,827 document-summary pairs.
We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset.
- Score: 43.375797352517765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce a large-scale Indonesian summarization dataset.
We harvest articles from Liputan6.com, an online news portal, and obtain
215,827 document-summary pairs. We leverage pre-trained language models to
develop benchmark extractive and abstractive summarization methods over the
dataset with multilingual and monolingual BERT-based models. We include a
thorough error analysis by examining machine-generated summaries that have low
ROUGE scores, and expose both issues with ROUGE it-self, as well as with
extractive and abstractive summarization models.
Related papers
- Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback.
Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z) - From News to Summaries: Building a Hungarian Corpus for Extractive and Abstractive Summarization [0.19107347888374507]
HunSum-2 is an open-source Hungarian corpus suitable for training abstractive and extractive summarization models.
The dataset is assembled from segments of the Common Crawl corpus undergoing thorough cleaning.
arXiv Detail & Related papers (2024-04-04T16:07:06Z) - Abstractive Text Summarization Using the BRIO Training Paradigm [2.102846336724103]
This paper presents a technique to improve abstractive summaries by fine-tuning pre-trained language models.
We build a text summarization dataset for Vietnamese, called VieSum.
We perform experiments with abstractive summarization models trained with the BRIO paradigm on the CNNDM and the VieSum datasets.
arXiv Detail & Related papers (2023-05-23T05:09:53Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Evaluation of Abstractive Summarisation Models with Machine Translation
in Deliberative Processes [23.249742737907905]
This dataset reflects difficulties of combining multiple narratives, mostly of poor grammatical quality, in a single text.
We report an extensive evaluation of a wide range of abstractive summarisation models in combination with an off-the-shelf machine translation model.
We obtain promising results regarding the fluency, consistency and relevance of the summaries produced.
arXiv Detail & Related papers (2021-10-12T09:23:57Z) - Bengali Abstractive News Summarization(BANS): A Neural Attention
Approach [0.8793721044482612]
We present a seq2seq based Long Short-Term Memory (LSTM) network model with attention at encoder-decoder.
Our proposed system deploys a local attention-based model that produces a long sequence of words with lucid and human-like generated sentences.
We also prepared a dataset of more than 19k articles and corresponding human-written summaries collected from bangla.bdnews24.com1.
arXiv Detail & Related papers (2020-12-03T08:17:31Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Multi-Fact Correction in Abstractive Text Summarization [98.27031108197944]
Span-Fact is a suite of two factual correction models that leverages knowledge learned from question answering models to make corrections in system-generated summaries via span selection.
Our models employ single or multi-masking strategies to either iteratively or auto-regressively replace entities in order to ensure semantic consistency w.r.t. the source text.
Experiments show that our models significantly boost the factual consistency of system-generated summaries without sacrificing summary quality in terms of both automatic metrics and human evaluation.
arXiv Detail & Related papers (2020-10-06T02:51:02Z) - Leveraging Graph to Improve Abstractive Multi-Document Summarization [50.62418656177642]
We develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents.
Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents.
Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries.
arXiv Detail & Related papers (2020-05-20T13:39:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.