CiteSum: Citation Text-guided Scientific Extreme Summarization and
Low-resource Domain Adaptation
- URL: http://arxiv.org/abs/2205.06207v1
- Date: Thu, 12 May 2022 16:44:19 GMT
- Title: CiteSum: Citation Text-guided Scientific Extreme Summarization and
Low-resource Domain Adaptation
- Authors: Yuning Mao, Ming Zhong, Jiawei Han
- Abstract summary: We create a new benchmark CiteSum without human annotation, which is around 30 times larger than the previous human-curated dataset SciTLDR.
For scientific extreme summarization, CITES outperforms most fully-supervised methods on SciTLDR without any fine-tuning.
For news extreme summarization, CITES achieves significant gains on XSum over its base model.
- Score: 41.494287785760534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scientific extreme summarization (TLDR) aims to form ultra-short summaries of
scientific papers. Previous efforts on curating scientific TLDR datasets failed
to scale up due to the heavy human annotation and domain expertise required. In
this paper, we propose a simple yet effective approach to automatically
extracting TLDR summaries for scientific papers from their citation texts.
Based on the proposed approach, we create a new benchmark CiteSum without human
annotation, which is around 30 times larger than the previous human-curated
dataset SciTLDR. We conduct a comprehensive analysis of CiteSum, examining its
data characteristics and establishing strong baselines. We further demonstrate
the usefulness of CiteSum by adapting models pre-trained on CiteSum (named
CITES) to new tasks and domains with limited supervision. For scientific
extreme summarization, CITES outperforms most fully-supervised methods on
SciTLDR without any fine-tuning and obtains state-of-the-art results with only
128 examples. For news extreme summarization, CITES achieves significant gains
on XSum over its base model (not pre-trained on CiteSum), e.g., +7.2 ROUGE-1
zero-shot performance and state-of-the-art few-shot performance. For news
headline generation, CITES performs the best among unsupervised and zero-shot
methods on Gigaword.
Related papers
- CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era [51.63024682584688]
Large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications.<n>We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing.<n>Our framework significantly outperforms prior methods in both accuracy and interpretability.
arXiv Detail & Related papers (2026-02-26T19:17:39Z) - Not too long do read: Evaluating LLM-generated extreme scientific summaries [0.0]
We propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers.<n>We then test popular open-weight LLMs for generating extreme summaries based on abstracts.<n>Our analysis reveals that, although some of them successfully produce humanoid summaries, LLMs generally exhibit a greater affinity for the original text's lexical choices and rhetorical structures.
arXiv Detail & Related papers (2025-12-29T05:03:02Z) - Attribution in Scientific Literature: New Benchmark and Methods [41.64918533152914]
Large language models (LLMs) present a promising yet challenging frontier for automated source citation in scientific communication.
We introduce REASONS, a novel dataset with sentence-level annotations across 12 scientific domains from arXiv.
We conduct extensive experiments with models such as GPT-O1, GPT-4O, GPT-3.5, DeepSeek, and other smaller models like Perplexity AI (7B)
arXiv Detail & Related papers (2024-05-03T16:38:51Z) - AugSumm: towards generalizable speech summarization using synthetic
labels from large language model [61.73741195292997]
Abstractive speech summarization (SSUM) aims to generate human-like summaries from speech.
conventional SSUM models are mostly trained and evaluated with a single ground-truth (GT) human-annotated deterministic summary.
We propose AugSumm, a method to leverage large language models (LLMs) as a proxy for human annotators to generate augmented summaries.
arXiv Detail & Related papers (2024-01-10T18:39:46Z) - Scientific Paper Extractive Summarization Enhanced by Citation Graphs [50.19266650000948]
We focus on leveraging citation graphs to improve scientific paper extractive summarization under different settings.
Preliminary results demonstrate that citation graph is helpful even in a simple unsupervised framework.
Motivated by this, we propose a Graph-based Supervised Summarization model (GSS) to achieve more accurate results on the task when large-scale labeled data are available.
arXiv Detail & Related papers (2022-12-08T11:53:12Z) - TSTR: Too Short to Represent, Summarize with Details! Intro-Guided
Extended Summary Generation [22.738731393540633]
In domains where the source text is relatively long-form, such as in scientific documents, such summary is not able to go beyond the general and coarse overview.
In this paper, we propose TSTR, an extractive summarizer that utilizes the introductory information of documents as pointers to their salient information.
arXiv Detail & Related papers (2022-06-02T02:45:31Z) - CiteWorth: Cite-Worthiness Detection for Improved Scientific Document
Understanding [23.930041685595775]
We present an in-depth study of cite-worthiness detection in English, where a sentence is labelled for whether or not it cites an external source.
CiteWorth is high-quality, challenging, and suitable for studying problems such as domain adaptation.
arXiv Detail & Related papers (2021-05-23T11:08:45Z) - Transductive Learning for Abstractive News Summarization [24.03781438153328]
We propose the first application of transductive learning to summarization.
We show that our approach yields state-of-the-art results on CNN/DM and NYT datasets.
arXiv Detail & Related papers (2021-04-17T17:33:12Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z) - Improving Zero and Few-Shot Abstractive Summarization with Intermediate
Fine-tuning and Data Augmentation [101.26235068460551]
Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks.
Models are typically fine-tuned on hundreds of thousands of data points, an infeasible requirement when applying summarization to new, niche domains.
We introduce a novel and generalizable method, called WikiTransfer, for fine-tuning pretrained models for summarization in an unsupervised, dataset-specific manner.
arXiv Detail & Related papers (2020-10-24T08:36:49Z) - TLDR: Extreme Summarization of Scientific Documents [38.11051158313414]
SciTLDR is a dataset of 5.4K TLDRs over 3.2K papers.
We propose CATTS, a simple yet effective learning strategy for generating TLDRs.
Data and code are publicly available at https://www.allenai.com/scitldr.
arXiv Detail & Related papers (2020-04-30T17:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.