SPECTER: Document-level Representation Learning using Citation-informed
Transformers
- URL: http://arxiv.org/abs/2004.07180v4
- Date: Wed, 20 May 2020 17:39:52 GMT
- Title: SPECTER: Document-level Representation Learning using Citation-informed
Transformers
- Authors: Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, Daniel S. Weld
- Abstract summary: SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
- Score: 51.048515757909215
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Representation learning is a critical ingredient for natural language
processing systems. Recent Transformer language models like BERT learn powerful
textual representations, but these models are targeted towards token- and
sentence-level training objectives and do not leverage information on
inter-document relatedness, which limits their document-level representation
power. For applications on scientific documents, such as classification and
recommendation, the embeddings power strong performance on end tasks. We
propose SPECTER, a new method to generate document-level embedding of
scientific documents based on pretraining a Transformer language model on a
powerful signal of document-level relatedness: the citation graph. Unlike
existing pretrained language models, SPECTER can be easily applied to
downstream applications without task-specific fine-tuning. Additionally, to
encourage further research on document-level models, we introduce SciDocs, a
new evaluation benchmark consisting of seven document-level tasks ranging from
citation prediction, to document classification and recommendation. We show
that SPECTER outperforms a variety of competitive baselines on the benchmark.
Related papers
- Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for
Interdisciplinary Science [0.0]
Large language models record impressive performance on many natural language processing tasks.
Retrieval augmentation offers an effective solution by retrieving context from external knowledge sources.
We propose a novel structure-aware retrieval augmented language model that accommodates document structure during retrieval augmentation.
arXiv Detail & Related papers (2023-11-21T02:02:46Z) - Probing Representations for Document-level Event Extraction [30.523959637364484]
This work is the first to apply the probing paradigm to representations learned for document-level information extraction.
We designed eight embedding probes to analyze surface, semantic, and event-understanding capabilities relevant to document-level event extraction.
We found that trained encoders from these models yield embeddings that can modestly improve argument detections and labeling but only slightly enhance event-level tasks.
arXiv Detail & Related papers (2023-10-23T19:33:04Z) - SciRepEval: A Multi-Format Benchmark for Scientific Document
Representations [52.01865318382197]
We introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations.
We show how state-of-the-art models like SPECTER and SciNCL struggle to generalize across the task formats.
A new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance.
arXiv Detail & Related papers (2022-11-23T21:25:39Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Contrastive Document Representation Learning with Graph Attention
Networks [18.22722084624321]
We propose to use a graph attention network on top of the available pretrained Transformers model to learn document embeddings.
In addition, based on our graph document model, we design a simple contrastive learning strategy to pretrain our models on a large amount of unlabeled corpus.
arXiv Detail & Related papers (2021-10-20T21:05:02Z) - A Sentence-level Hierarchical BERT Model for Document Classification
with Limited Labelled Data [5.123298347655086]
This work introduces a long-text-specific model -- the Hierarchical BERT Model (HBM) -- that learns sentence-level features of the text and works well in scenarios with limited data.
Various evaluation experiments have demonstrated that HBM can achieve higher performance in document classification than the previous state-of-the-art methods with only 50 to 200 labelled instances.
arXiv Detail & Related papers (2021-06-12T10:45:24Z) - LAWDR: Language-Agnostic Weighted Document Representations from
Pre-trained Models [8.745407715423992]
Cross-lingual document representations enable language understanding in multilingual contexts.
Large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks.
arXiv Detail & Related papers (2021-06-07T07:14:00Z) - Rethinking Document-level Neural Machine Translation [73.42052953710605]
We try to answer the question: Is the capacity of current models strong enough for document-level translation?
We observe that the original Transformer with appropriate training techniques can achieve strong results for document translation, even with a length of 2000 words.
arXiv Detail & Related papers (2020-10-18T11:18:29Z) - Document-level Neural Machine Translation with Document Embeddings [82.4684444847092]
This work focuses on exploiting detailed document-level context in terms of multiple forms of document embeddings.
The proposed document-aware NMT is implemented to enhance the Transformer baseline by introducing both global and local document-level clues on the source end.
arXiv Detail & Related papers (2020-09-16T19:43:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.