ACLSum: A New Dataset for Aspect-based Summarization of Scientific
Publications
- URL: http://arxiv.org/abs/2403.05303v1
- Date: Fri, 8 Mar 2024 13:32:01 GMT
- Title: ACLSum: A New Dataset for Aspect-based Summarization of Scientific
Publications
- Authors: Sotaro Takeshita, Tommaso Green, Ines Reinig, Kai Eckert, Simone Paolo
Ponzetto
- Abstract summary: ACLSum is a novel summarization dataset carefully crafted and evaluated by domain experts.
In contrast to previous datasets, ACLSum facilitates multi-aspect summarization of scientific papers.
- Score: 10.529898520273063
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Extensive efforts in the past have been directed toward the development of
summarization datasets. However, a predominant number of these resources have
been (semi)-automatically generated, typically through web data crawling,
resulting in subpar resources for training and evaluating summarization
systems, a quality compromise that is arguably due to the substantial costs
associated with generating ground-truth summaries, particularly for diverse
languages and specialized domains. To address this issue, we present ACLSum, a
novel summarization dataset carefully crafted and evaluated by domain experts.
In contrast to previous datasets, ACLSum facilitates multi-aspect summarization
of scientific papers, covering challenges, approaches, and outcomes in depth.
Through extensive experiments, we evaluate the quality of our resource and the
performance of models based on pretrained language models and state-of-the-art
large language models (LLMs). Additionally, we explore the effectiveness of
extractive versus abstractive summarization within the scholarly domain on the
basis of automatically discovered aspects. Our results corroborate previous
findings in the general domain and indicate the general superiority of
end-to-end aspect-based summarization. Our data is released at
https://github.com/sobamchan/aclsum.
Related papers
- Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback.
Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z) - Beyond Finite Data: Towards Data-free Out-of-distribution Generalization
via Extrapolation [19.944946262284123]
Humans can easily extrapolate novel domains, thus, an intriguing question arises: How can neural networks extrapolate like humans and achieve OOD generalization?
We introduce a novel approach to domain extrapolation that leverages reasoning ability and the extensive knowledge encapsulated within large language models (LLMs) to synthesize entirely new domains.
Our methods exhibit commendable performance in this setting, even surpassing the supervised setting by approximately 1-2% on datasets such as VLCS.
arXiv Detail & Related papers (2024-03-08T18:44:23Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure.
We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - On the State of German (Abstractive) Text Summarization [3.1776833268555134]
We assess the landscape of German abstractive text summarization.
We investigate why practically useful solutions for abstractive text summarization are still absent in industry.
arXiv Detail & Related papers (2023-01-17T18:59:20Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - Unsupervised Opinion Summarization with Content Planning [58.5308638148329]
We show that explicitly incorporating content planning in a summarization model yields output of higher quality.
We also create synthetic datasets which are more natural, resembling real world document-summary pairs.
Our approach outperforms competitive models in generating informative, coherent, and fluent summaries.
arXiv Detail & Related papers (2020-12-14T18:41:58Z) - CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural
Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting.
A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.