CLTS+: A New Chinese Long Text Summarization Dataset with Abstractive
Summaries
- URL: http://arxiv.org/abs/2206.04253v1
- Date: Thu, 9 Jun 2022 03:53:52 GMT
- Title: CLTS+: A New Chinese Long Text Summarization Dataset with Abstractive
Summaries
- Authors: Xiaojun Liu, Shunan Zang, Chuang Zhang, Xiaojun Chen, Yangyang Ding
- Abstract summary: Abstractive methods lack of creative ability is particularly a problem in automatic text summarization.
We propose the first Chinese Long Text Summarization dataset with a high level of abstractiveness, CLTS+.
We analyze the extraction strategies used in CLTS+ summaries against other datasets to quantify the abstractiveness and difficulty of our new data.
- Score: 10.113673549224256
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The abstractive methods lack of creative ability is particularly a problem in
automatic text summarization. The summaries generated by models are mostly
extracted from the source articles. One of the main causes for this problem is
the lack of dataset with abstractiveness, especially for Chinese. In order to
solve this problem, we paraphrase the reference summaries in CLTS, the Chinese
Long Text Summarization dataset, correct errors of factual inconsistencies, and
propose the first Chinese Long Text Summarization dataset with a high level of
abstractiveness, CLTS+, which contains more than 180K article-summary pairs and
is available online. Additionally, we introduce an intrinsic metric based on
co-occurrence words to evaluate the dataset we constructed. We analyze the
extraction strategies used in CLTS+ summaries against other datasets to
quantify the abstractiveness and difficulty of our new data and train several
baselines on CLTS+ to verify the utility of it for improving the creative
ability of models.
Related papers
- Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback.
Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z) - On Context Utilization in Summarization with Large Language Models [83.84459732796302]
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries.
Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens.
We conduct the first comprehensive study on context utilization and position bias in summarization.
arXiv Detail & Related papers (2023-10-16T16:45:12Z) - Element-aware Summarization with Large Language Models: Expert-aligned
Evaluation and Chain-of-Thought Method [35.181659789684545]
Automatic summarization generates concise summaries that contain key ideas of source documents.
References from CNN/DailyMail and BBC XSum are noisy, mainly in terms of factual hallucination and information redundancy.
We propose a Summary Chain-of-Thought (SumCoT) technique to elicit LLMs to generate summaries step by step.
Experimental results show our method outperforms state-of-the-art fine-tuned PLMs and zero-shot LLMs by +4.33/+4.77 in ROUGE-L.
arXiv Detail & Related papers (2023-05-22T18:54:35Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Generating Multiple-Length Summaries via Reinforcement Learning for
Unsupervised Sentence Summarization [44.835811239393244]
Sentence summarization shortens given texts while maintaining core contents of the texts.
Unsupervised approaches have been studied to summarize texts without human-written summaries.
We devise an abstractive model based on reinforcement learning without ground-truth summaries.
arXiv Detail & Related papers (2022-12-21T08:34:28Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - MACSum: Controllable Summarization with Mixed Attributes [56.685735509260276]
MACSum is the first human-annotated summarization dataset for controlling mixed attributes.
We propose two simple and effective parameter-efficient approaches for the new task of mixed controllable summarization.
arXiv Detail & Related papers (2022-11-09T17:17:37Z) - CNewSum: A Large-scale Chinese News Summarization Dataset with
Human-annotated Adequacy and Deducibility Level [15.969302324314516]
We present a large-scale Chinese news summarization dataset CNewSum.
It consists of 304,307 documents and human-written summaries for the news feed.
Its test set contains adequacy and deducibility annotations for the summaries.
arXiv Detail & Related papers (2021-10-21T03:37:46Z) - Topic Modeling Based Extractive Text Summarization [0.0]
We propose a novel method to summarize a text document by clustering its contents based on latent topics.
We utilize the lesser used and challenging WikiHow dataset in our approach to text summarization.
arXiv Detail & Related papers (2021-06-29T12:28:19Z) - Liputan6: A Large-scale Indonesian Dataset for Text Summarization [43.375797352517765]
We harvest articles from Liputan6.com, an online news portal, and obtain 215,827 document-summary pairs.
We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset.
arXiv Detail & Related papers (2020-11-02T02:01:12Z) - Multi-Fact Correction in Abstractive Text Summarization [98.27031108197944]
Span-Fact is a suite of two factual correction models that leverages knowledge learned from question answering models to make corrections in system-generated summaries via span selection.
Our models employ single or multi-masking strategies to either iteratively or auto-regressively replace entities in order to ensure semantic consistency w.r.t. the source text.
Experiments show that our models significantly boost the factual consistency of system-generated summaries without sacrificing summary quality in terms of both automatic metrics and human evaluation.
arXiv Detail & Related papers (2020-10-06T02:51:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.