ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for
Interdisciplinary Science
- URL: http://arxiv.org/abs/2311.12289v1
- Date: Tue, 21 Nov 2023 02:02:46 GMT
- Title: ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for
Interdisciplinary Science
- Authors: Sai Munikoti, Anurag Acharya, Sridevi Wagle, Sameera Horawalavithana
- Abstract summary: Large language models record impressive performance on many natural language processing tasks.
Retrieval augmentation offers an effective solution by retrieving context from external knowledge sources.
We propose a novel structure-aware retrieval augmented language model that accommodates document structure during retrieval augmentation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models record impressive performance on many natural language
processing tasks. However, their knowledge capacity is limited to the
pretraining corpus. Retrieval augmentation offers an effective solution by
retrieving context from external knowledge sources to complement the language
model. However, existing retrieval augmentation techniques ignore the
structural relationships between these documents. Furthermore, retrieval models
are not explored much in scientific tasks, especially in regard to the
faithfulness of retrieved documents. In this paper, we propose a novel
structure-aware retrieval augmented language model that accommodates document
structure during retrieval augmentation. We create a heterogeneous document
graph capturing multiple types of relationships (e.g., citation, co-authorship,
etc.) that connect documents from more than 15 scientific disciplines (e.g.,
Physics, Medicine, Chemistry, etc.). We train a graph neural network on the
curated document graph to act as a structural encoder for the corresponding
passages retrieved during the model pretraining. Particularly, along with text
embeddings of the retrieved passages, we obtain structural embeddings of the
documents (passages) and fuse them together before feeding them to the language
model. We evaluate our model extensively on various scientific benchmarks that
include science question-answering and scientific document classification
tasks. Experimental results demonstrate that structure-aware retrieval improves
retrieving more coherent, faithful and contextually relevant passages, while
showing a comparable performance in the overall accuracy.
Related papers
- Synthetic continued pretraining [29.6872772403251]
We propose synthetic continued pretraining on a small corpus of domain-specific documents.
We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm.
We show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.
arXiv Detail & Related papers (2024-09-11T17:21:59Z) - Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research [139.69207791947738]
Dolma is a three-trillion-token English corpus built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials.
We document Dolma, including its design principles, details about its construction, and a summary of its contents.
We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices.
arXiv Detail & Related papers (2024-01-31T20:29:50Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Are Neural Language Models Good Plagiarists? A Benchmark for Neural
Paraphrase Detection [5.847824494580938]
We propose a benchmark consisting of paraphrased articles using recent language models relying on the Transformer architecture.
Our contribution fosters future research of paraphrase detection systems as it offers a large collection of aligned original and paraphrased documents.
arXiv Detail & Related papers (2021-03-23T11:01:35Z) - Neural Deepfake Detection with Factual Structure of Text [78.30080218908849]
We propose a graph-based model for deepfake detection of text.
Our approach represents the factual structure of a given document as an entity graph.
Our model can distinguish the difference in the factual structure between machine-generated text and human-written text.
arXiv Detail & Related papers (2020-10-15T02:35:31Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z) - Explaining Relationships Between Scientific Documents [55.23390424044378]
We address the task of explaining relationships between two scientific documents using natural language text.
In this paper we establish a dataset of 622K examples from 154K documents.
arXiv Detail & Related papers (2020-02-02T03:54:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.