Related papers: Structured information extraction from complex scientific text with fine-tuned large language models

Structured information extraction from complex scientific text with fine-tuned large language models

URL: http://arxiv.org/abs/2212.05238v1
Date: Sat, 10 Dec 2022 07:51:52 GMT
Title: Structured information extraction from complex scientific text with fine-tuned large language models
Authors: Alexander Dunn, John Dagdelen, Nicholas Walker, Sanghoon Lee, Andrew S. Rosen, Gerbrand Ceder, Kristin Persson, Anubhav Jain
Abstract summary: We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
Score: 55.96705756327738
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Intelligently extracting and linking complex scientific information from unstructured text is a challenging endeavor particularly for those inexperienced with natural language processing. Here, we present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction for complex hierarchical information in scientific text. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts (inputs) and completions (outputs). Information is extracted either from single sentences or across sentences in abstracts/passages, and the output can be returned as simple English sentences or a more structured format, such as a list of JSON objects. We demonstrate that LLMs trained in this way are capable of accurately extracting useful records of complex scientific knowledge for three representative tasks in materials chemistry: linking dopants with their host materials, cataloging metal-organic frameworks, and general chemistry/phase/morphology/application information extraction. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text. An online demo is available at http://www.matscholar.com/info-extraction.

Related papers

What's In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models [4.8261605642238745]
Large language models (LLMs) fail to capture detailed relationships across large bodies of work. Structured representations offer a natural complement -- enabling systematic analysis across the whole corpus. We prototype a system that answers precise questions about the literature as a whole.
arXiv Detail & Related papers (2025-03-12T23:24:40Z)
MatViX: Multimodal Information Extraction from Visually Rich Articles [6.349779979863784]
In materials science, extracting structured information from research articles can accelerate the discovery of new materials. We introduce textscMatViX, a benchmark consisting of $324$ full-length research articles and $1,688$ complex structured files. These files are extracted from text, tables, and figures in full-length documents, providing a comprehensive challenge for MIE.
arXiv Detail & Related papers (2024-10-27T16:13:58Z)
Synthetic continued pretraining [29.6872772403251]
We propose synthetic continued pretraining on a small corpus of domain-specific documents. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm. We show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.
arXiv Detail & Related papers (2024-09-11T17:21:59Z)
FabricQA-Extractor: A Question Answering System to Extract Information from Documents using Natural Language Questions [4.961045761391367]
Reading comprehension models answer questions posed in natural language when provided with a short passage of text. We introduce a new model, Relation Coherence, that exploits knowledge of the relational structure to improve the extraction quality. We demonstrate on two datasets that Relation Coherence boosts extraction performance and evaluate FabricQA-Extractor on large scale datasets.
arXiv Detail & Related papers (2024-08-17T15:16:54Z)
DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge. Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z)
Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions. We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z)
Syntactic Complexity Identification, Measurement, and Reduction Through Controlled Syntactic Simplification [0.0]
We present a classical syntactic dependency-based approach to split and rephrase a compound and complex sentence into a set of simplified sentences. The paper also introduces an algorithm to identify and measure a sentence's syntactic complexity. This work is accepted and presented in International workshop on Learning with Knowledge Graphs (IWLKG) at WSDM-2023 Conference.
arXiv Detail & Related papers (2023-04-16T13:13:58Z)
ReSel: N-ary Relation Extraction from Scientific Text and Tables by Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles. Our proposed method ReSel decomposes this task into a two-stage procedure. Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z)
Explaining Patterns in Data with Language Models via Interpretable Autoprompting [143.4162028260874]
We introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data. iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions. Experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery.
arXiv Detail & Related papers (2022-10-04T18:32:14Z)
HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information [0.6443952406204634]
We propose a novel approach to formulate, extract, encode and inject hierarchical structure information explicitly into an extractive summarization model. Using various experimental settings on three datasets (i.e., CNN/DailyMail, PubMed and arXiv), our HiStruct+ model outperforms a strong baseline collectively.
arXiv Detail & Related papers (2022-03-17T21:49:26Z)
ENT-DESC: Entity Description Generation by Exploring Knowledge Graph [53.03778194567752]
In practice, the input knowledge could be more than enough, since the output description may only cover the most significant knowledge. We introduce a large-scale and challenging dataset to facilitate the study of such a practical scenario in KG-to-text. We propose a multi-graph structure that is able to represent the original graph information more comprehensively.
arXiv Detail & Related papers (2020-04-30T14:16:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.