Structured information extraction from complex scientific text with
fine-tuned large language models
- URL: http://arxiv.org/abs/2212.05238v1
- Date: Sat, 10 Dec 2022 07:51:52 GMT
- Title: Structured information extraction from complex scientific text with
fine-tuned large language models
- Authors: Alexander Dunn, John Dagdelen, Nicholas Walker, Sanghoon Lee, Andrew
S. Rosen, Gerbrand Ceder, Kristin Persson, Anubhav Jain
- Abstract summary: We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
- Score: 55.96705756327738
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Intelligently extracting and linking complex scientific information from
unstructured text is a challenging endeavor particularly for those
inexperienced with natural language processing. Here, we present a simple
sequence-to-sequence approach to joint named entity recognition and relation
extraction for complex hierarchical information in scientific text. The
approach leverages a pre-trained large language model (LLM), GPT-3, that is
fine-tuned on approximately 500 pairs of prompts (inputs) and completions
(outputs). Information is extracted either from single sentences or across
sentences in abstracts/passages, and the output can be returned as simple
English sentences or a more structured format, such as a list of JSON objects.
We demonstrate that LLMs trained in this way are capable of accurately
extracting useful records of complex scientific knowledge for three
representative tasks in materials chemistry: linking dopants with their host
materials, cataloging metal-organic frameworks, and general
chemistry/phase/morphology/application information extraction. This approach
represents a simple, accessible, and highly-flexible route to obtaining large
databases of structured knowledge extracted from unstructured text. An online
demo is available at http://www.matscholar.com/info-extraction.
Related papers
- Learning to Solve Complex Problems via Dataset Decomposition [53.1641602054716]
This research explores a reverse curriculum generation approach that decomposes complex datasets into simpler, more learnable components.<n>We propose a teacher-student framework where the teacher is equipped with the ability to reason step-by-step, which is used to generate easier versions of examples.
arXiv Detail & Related papers (2026-02-23T19:25:40Z) - Exploring LLMs for Scientific Information Extraction Using The SciEx Framework [12.534492015126757]
Large language models (LLMs) are touted as powerful tools for automating scientific information extraction.<n>We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation.<n>We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently.
arXiv Detail & Related papers (2025-12-10T19:00:20Z) - OmniStruct: Universal Text-to-Structure Generation across Diverse Schemas [57.49565459553627]
We introduce OmniStruct, a benchmark for assessing Large Language Models' capabilities on text-to-structure tasks.<n>We collect high-quality training data via synthetic task generation to facilitate the development of efficient text-to-structure models.<n>Our experiments demonstrate the possibility of fine-tuning much smaller models on synthetic data into universal structured generation models.
arXiv Detail & Related papers (2025-11-23T08:18:12Z) - LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature [60.879220305044726]
We propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data.<n>We curated 81k open-access papers, yielding LeMat- Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes.<n>We release a modular, open-source library designed to support community-driven extension to new corpora and synthesis domains.
arXiv Detail & Related papers (2025-10-28T17:58:18Z) - ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature [0.2447206672789868]
ComProScanner is an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of chemical compositions and properties.<n>We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models.<n>DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82.
arXiv Detail & Related papers (2025-10-23T09:01:44Z) - Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles [81.89404347890662]
SciTrek is a novel question-answering benchmark designed to evaluate the long-context reasoning capabilities of large language models (LLMs) using scientific articles.<n>Our analysis reveals systematic shortcomings in models' abilities to perform basic numerical operations and accurately locate specific information in long contexts.
arXiv Detail & Related papers (2025-09-25T11:36:09Z) - What's In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models [4.8261605642238745]
Large language models (LLMs) fail to capture detailed relationships across large bodies of work.
Structured representations offer a natural complement -- enabling systematic analysis across the whole corpus.
We prototype a system that answers precise questions about the literature as a whole.
arXiv Detail & Related papers (2025-03-12T23:24:40Z) - MatViX: Multimodal Information Extraction from Visually Rich Articles [6.349779979863784]
In materials science, extracting structured information from research articles can accelerate the discovery of new materials.
We introduce textscMatViX, a benchmark consisting of $324$ full-length research articles and $1,688$ complex structured files.
These files are extracted from text, tables, and figures in full-length documents, providing a comprehensive challenge for MIE.
arXiv Detail & Related papers (2024-10-27T16:13:58Z) - Synthetic continued pretraining [29.6872772403251]
We propose synthetic continued pretraining on a small corpus of domain-specific documents.
We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm.
We show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.
arXiv Detail & Related papers (2024-09-11T17:21:59Z) - FabricQA-Extractor: A Question Answering System to Extract Information from Documents using Natural Language Questions [4.961045761391367]
Reading comprehension models answer questions posed in natural language when provided with a short passage of text.
We introduce a new model, Relation Coherence, that exploits knowledge of the relational structure to improve the extraction quality.
We demonstrate on two datasets that Relation Coherence boosts extraction performance and evaluate FabricQA-Extractor on large scale datasets.
arXiv Detail & Related papers (2024-08-17T15:16:54Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions.
Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z) - Syntactic Complexity Identification, Measurement, and Reduction Through
Controlled Syntactic Simplification [0.0]
We present a classical syntactic dependency-based approach to split and rephrase a compound and complex sentence into a set of simplified sentences.
The paper also introduces an algorithm to identify and measure a sentence's syntactic complexity.
This work is accepted and presented in International workshop on Learning with Knowledge Graphs (IWLKG) at WSDM-2023 Conference.
arXiv Detail & Related papers (2023-04-16T13:13:58Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - Explaining Patterns in Data with Language Models via Interpretable
Autoprompting [143.4162028260874]
We introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data.
iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions.
Experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery.
arXiv Detail & Related papers (2022-10-04T18:32:14Z) - HiStruct+: Improving Extractive Text Summarization with Hierarchical
Structure Information [0.6443952406204634]
We propose a novel approach to formulate, extract, encode and inject hierarchical structure information explicitly into an extractive summarization model.
Using various experimental settings on three datasets (i.e., CNN/DailyMail, PubMed and arXiv), our HiStruct+ model outperforms a strong baseline collectively.
arXiv Detail & Related papers (2022-03-17T21:49:26Z) - ENT-DESC: Entity Description Generation by Exploring Knowledge Graph [53.03778194567752]
In practice, the input knowledge could be more than enough, since the output description may only cover the most significant knowledge.
We introduce a large-scale and challenging dataset to facilitate the study of such a practical scenario in KG-to-text.
We propose a multi-graph structure that is able to represent the original graph information more comprehensively.
arXiv Detail & Related papers (2020-04-30T14:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.