TLDR: Extreme Summarization of Scientific Documents
- URL: http://arxiv.org/abs/2004.15011v3
- Date: Thu, 8 Oct 2020 22:41:44 GMT
- Title: TLDR: Extreme Summarization of Scientific Documents
- Authors: Isabel Cachola, Kyle Lo, Arman Cohan, Daniel S. Weld
- Abstract summary: SciTLDR is a dataset of 5.4K TLDRs over 3.2K papers.
We propose CATTS, a simple yet effective learning strategy for generating TLDRs.
Data and code are publicly available at https://www.allenai.com/scitldr.
- Score: 38.11051158313414
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce TLDR generation, a new form of extreme summarization, for
scientific papers. TLDR generation involves high source compression and
requires expert background knowledge and understanding of complex
domain-specific language. To facilitate study on this task, we introduce
SciTLDR, a new multi-target dataset of 5.4K TLDRs over 3.2K papers. SciTLDR
contains both author-written and expert-derived TLDRs, where the latter are
collected using a novel annotation protocol that produces high-quality
summaries while minimizing annotation burden. We propose CATTS, a simple yet
effective learning strategy for generating TLDRs that exploits titles as an
auxiliary training signal. CATTS improves upon strong baselines under both
automated metrics and human evaluations. Data and code are publicly available
at https://github.com/allenai/scitldr.
Related papers
- Not too long do read: Evaluating LLM-generated extreme scientific summaries [0.0]
We propose a novel dataset, BiomedTLDR, containing a large sample of researcher-authored summaries from scientific papers.<n>We then test popular open-weight LLMs for generating extreme summaries based on abstracts.<n>Our analysis reveals that, although some of them successfully produce humanoid summaries, LLMs generally exhibit a greater affinity for the original text's lexical choices and rhetorical structures.
arXiv Detail & Related papers (2025-12-29T05:03:02Z) - Hybrid Annotation for Propaganda Detection: Integrating LLM Pre-Annotations with Human Intelligence [8.856227991149506]
This paper introduces a novel framework that combines human expertise with Large Language Model (LLM) assistance to improve both annotation consistency and scalability.<n>We propose a hierarchical taxonomy that organizes 14 fine-grained propaganda techniques into three broader categories.<n>We implement an LLM-assisted pre-annotation pipeline that extracts propagandistic spans, generates concise explanations, and assigns local labels as well as a global label.
arXiv Detail & Related papers (2025-07-24T12:16:52Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content.
Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning.
Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z) - TL;DR Progress: Multi-faceted Literature Exploration in Text
Summarization [37.88261925867143]
This paper presents TL;DR Progress, a new tool for exploring the literature on neural text summarization.
It organizes 514papers based on a comprehensive annotation scheme for text summarization approaches.
arXiv Detail & Related papers (2024-02-10T09:16:56Z) - Fusing Multimodal Signals on Hyper-complex Space for Extreme Abstractive
Text Summarization (TL;DR) of Scientific Contents [26.32569293387399]
We deal with a novel task of extreme abstractive text summarization (aka TL;DR generation) by leveraging multiple input modalities.
The mTLDR dataset accompanies a total of 4,182 instances collected from various academic conference proceedings.
We present mTLDRgen, an encoder-decoder-based model that employs a novel dual-fused hyper-complex Transformer.
arXiv Detail & Related papers (2023-06-24T13:51:42Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - CiteSum: Citation Text-guided Scientific Extreme Summarization and
Low-resource Domain Adaptation [41.494287785760534]
We create a new benchmark CiteSum without human annotation, which is around 30 times larger than the previous human-curated dataset SciTLDR.
For scientific extreme summarization, CITES outperforms most fully-supervised methods on SciTLDR without any fine-tuning.
For news extreme summarization, CITES achieves significant gains on XSum over its base model.
arXiv Detail & Related papers (2022-05-12T16:44:19Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Word-level Human Interpretable Scoring Mechanism for Novel Text
Detection Using Tsetlin Machines [16.457778420360537]
We propose a Tsetlin machine architecture for scoring individual words according to their contribution to novelty.
Our approach encodes a description of the novel documents using the linguistic patterns captured by TM clauses.
We then adopt this description to measure how much a word contributes to making documents novel.
arXiv Detail & Related papers (2021-05-10T23:41:14Z) - How to Train Your Agent to Read and Write [52.24605794920856]
Reading and writing research papers is one of the most privileged abilities that a qualified researcher should master.
It would be fascinating if we could train an intelligent agent to help people read and summarize papers, and perhaps even discover and exploit the potential knowledge clues to write novel papers.
We propose a Deep ReAder-Writer (DRAW) network, which consists of a textitReader that can extract knowledge graphs (KGs) from input paragraphs and discover potential knowledge, a graph-to-text textitWriter that generates a novel paragraph, and a textit
arXiv Detail & Related papers (2021-01-04T12:22:04Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.