LinkBERT: Pretraining Language Models with Document Links
- URL: http://arxiv.org/abs/2203.15827v1
- Date: Tue, 29 Mar 2022 18:01:24 GMT
- Title: LinkBERT: Pretraining Language Models with Document Links
- Authors: Michihiro Yasunaga, Jure Leskovec, Percy Liang
- Abstract summary: Language model (LM) pretraining can learn various knowledge from text corpora, helping downstream tasks.
We propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks.
We show that LinkBERT outperforms BERT on various downstream tasks across two domains.
- Score: 151.61148592954768
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language model (LM) pretraining can learn various knowledge from text
corpora, helping downstream tasks. However, existing methods such as BERT model
a single document, and do not capture dependencies or knowledge that span
across documents. In this work, we propose LinkBERT, an LM pretraining method
that leverages links between documents, e.g., hyperlinks. Given a text corpus,
we view it as a graph of documents and create LM inputs by placing linked
documents in the same context. We then pretrain the LM with two joint
self-supervised objectives: masked language modeling and our new proposal,
document relation prediction. We show that LinkBERT outperforms BERT on various
downstream tasks across two domains: the general domain (pretrained on
Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with
citation links). LinkBERT is especially effective for multi-hop reasoning and
few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our
biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7% on
BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT,
as well as code and data at https://github.com/michiyasunaga/LinkBERT.
Related papers
- RAFT: Adapting Language Model to Domain Specific RAG [75.63623523051491]
We present Retrieval Augmented FineTuning (RAFT), a training recipe that improves the model's ability to answer questions in a "openbook" in-domain settings.
RAFT accomplishes this by citing the verbatim right sequence from the relevant document that would help answer the question.
RAFT consistently improves the model's performance across PubMed, HotpotQA, and Gorilla datasets.
arXiv Detail & Related papers (2024-03-15T09:26:02Z) - NewsQs: Multi-Source Question Generation for the Inquiring Mind [59.79288644158271]
We present NewsQs, a dataset that provides question-answer pairs for multiple news documents.
To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles.
arXiv Detail & Related papers (2024-02-28T16:59:35Z) - Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings [3.944219308229571]
In Natural Language Processing (NLP), Machine Reading (MRC) is the task of answering a question based on a given context.
Modern language models such as BioBERT, SciBERT and even ChatGPT are trained on vast amounts of in-domain medical corpora.
We propose a resource-efficient approach for injecting domain knowledge into a model without relying on such domain-specific pre-training.
arXiv Detail & Related papers (2024-01-15T21:43:46Z) - Peek Across: Improving Multi-Document Modeling via Cross-Document
Question-Answering [49.85790367128085]
We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective.
This novel multi-document QA formulation directs the model to better recover cross-text informational relations.
Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
arXiv Detail & Related papers (2023-05-24T17:48:40Z) - Pre-training for Information Retrieval: Are Hyperlinks Fully Explored? [19.862211305690916]
We propose a progressive hyperlink predication (PHP) framework to explore the utilization of hyperlinks in pre-training.
Experimental results on two large-scale ad-hoc retrieval datasets and six question-answering datasets demonstrate its superiority over existing pre-training methods.
arXiv Detail & Related papers (2022-09-14T12:03:31Z) - Anchor Prediction: A Topic Modeling Approach [2.0411082897313984]
We propose an annotation, which we refer to as anchor prediction.
Given a source document and a target document, this task consists in automatically identifying anchors in the source document.
We propose a contextualized relational topic model, CRTM, that models directed links between documents.
arXiv Detail & Related papers (2022-05-29T11:26:52Z) - LP-BERT: Multi-task Pre-training Knowledge Graph BERT for Link
Prediction [3.5382535469099436]
LP-BERT contains two training stages: multi-task pre-training and knowledge graph fine-tuning.
We achieve state-of-the-art results on WN18RR and UMLS datasets, especially the Hits@10 indicator improved by 5%.
arXiv Detail & Related papers (2022-01-13T09:18:30Z) - TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural
Language Processing [64.87699383581885]
We introduce TextBrewer, an open-source knowledge distillation toolkit for natural language processing.
It supports various kinds of supervised learning tasks, such as text classification, reading comprehension, sequence labeling.
As a case study, we use TextBrewer to distill BERT on several typical NLP tasks.
arXiv Detail & Related papers (2020-02-28T09:44:07Z) - DC-BERT: Decoupling Question and Document for Efficient Contextual
Encoding [90.85913515409275]
Recent studies on open-domain question answering have achieved prominent performance improvement using pre-trained language models such as BERT.
We propose DC-BERT, a contextual encoding framework that has dual BERT models: an online BERT which encodes the question only once, and an offline BERT which pre-encodes all the documents and caches their encodings.
On SQuAD Open and Natural Questions Open datasets, DC-BERT achieves 10x speedup on document retrieval, while retaining most (about 98%) of the QA performance.
arXiv Detail & Related papers (2020-02-28T08:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.