CiteWorth: Cite-Worthiness Detection for Improved Scientific Document
Understanding
- URL: http://arxiv.org/abs/2105.10912v2
- Date: Tue, 25 May 2021 09:20:30 GMT
- Title: CiteWorth: Cite-Worthiness Detection for Improved Scientific Document
Understanding
- Authors: Dustin Wright and Isabelle Augenstein
- Abstract summary: We present an in-depth study of cite-worthiness detection in English, where a sentence is labelled for whether or not it cites an external source.
CiteWorth is high-quality, challenging, and suitable for studying problems such as domain adaptation.
- Score: 23.930041685595775
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scientific document understanding is challenging as the data is highly domain
specific and diverse. However, datasets for tasks with scientific text require
expensive manual annotation and tend to be small and limited to only one or a
few fields. At the same time, scientific documents contain many potential
training signals, such as citations, which can be used to build large labelled
datasets. Given this, we present an in-depth study of cite-worthiness detection
in English, where a sentence is labelled for whether or not it cites an
external source. To accomplish this, we introduce CiteWorth, a large,
contextualized, rigorously cleaned labelled dataset for cite-worthiness
detection built from a massive corpus of extracted plain-text scientific
documents. We show that CiteWorth is high-quality, challenging, and suitable
for studying problems such as domain adaptation. Our best performing
cite-worthiness detection model is a paragraph-level contextualized sentence
labelling model based on Longformer, exhibiting a 5 F1 point improvement over
SciBERT which considers only individual sentences. Finally, we demonstrate that
language model fine-tuning with cite-worthiness as a secondary task leads to
improved performance on downstream scientific document understanding tasks.
Related papers
- Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models [0.0]
We propose a Bidirectional Long Short-Term Memory (BiLSTM) network with attention mechanism and contextual information to detect sentences that need citations.
We produce a new, large dataset (PMOA-CITE) based on PubMed Open Access Subset, which is orders of magnitude larger than previous datasets.
arXiv Detail & Related papers (2024-05-20T17:45:36Z) - Context-Enhanced Language Models for Generating Multi-Paper Citations [35.80247519023821]
We propose a method that leverages Large Language Models (LLMs) to generate multi-citation sentences.
Our approach involves a single source paper and a collection of target papers, culminating in a coherent paragraph containing multi-sentence citation text.
arXiv Detail & Related papers (2024-04-22T04:30:36Z) - ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for
Interdisciplinary Science [0.0]
Large language models record impressive performance on many natural language processing tasks.
Retrieval augmentation offers an effective solution by retrieving context from external knowledge sources.
We propose a novel structure-aware retrieval augmented language model that accommodates document structure during retrieval augmentation.
arXiv Detail & Related papers (2023-11-21T02:02:46Z) - CiteCaseLAW: Citation Worthiness Detection in Caselaw for Legal
Assistive Writing [44.75251805925605]
We introduce a labeled dataset of 178M sentences for citation-worthiness detection in the legal domain from the Caselaw Access Project (CAP)
The performance of various deep learning models was examined on this novel dataset.
The domain-specific pre-trained model tends to outperform other models, with an 88% F1-score for the citation-worthiness detection task.
arXiv Detail & Related papers (2023-05-03T04:20:56Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - Scientific Paper Extractive Summarization Enhanced by Citation Graphs [50.19266650000948]
We focus on leveraging citation graphs to improve scientific paper extractive summarization under different settings.
Preliminary results demonstrate that citation graph is helpful even in a simple unsupervised framework.
Motivated by this, we propose a Graph-based Supervised Summarization model (GSS) to achieve more accurate results on the task when large-scale labeled data are available.
arXiv Detail & Related papers (2022-12-08T11:53:12Z) - Towards generating citation sentences for multiple references with
intent control [86.53829532976303]
We build a novel generation model with the Fusion-in-Decoder approach to cope with multiple long inputs.
Experiments demonstrate that the proposed approaches provide much more comprehensive features for generating citation sentences.
arXiv Detail & Related papers (2021-12-02T15:32:24Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.