Related papers: OpenChemIE: An Information Extraction Toolkit For Chemistry Literature

OpenChemIE: An Information Extraction Toolkit For Chemistry Literature

URL: http://arxiv.org/abs/2404.01462v1
Date: Mon, 1 Apr 2024 20:16:21 GMT
Title: OpenChemIE: An Information Extraction Toolkit For Chemistry Literature
Authors: Vincent Fan, Yujie Qian, Alex Wang, Amber Wang, Connor W. Coley, Regina Barzilay,
Abstract summary: OpenChemIE is a tool for extracting reaction data from chemistry literature. We employ specialized neural models that address a specific task for chemistry information extraction. We meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole.
Score: 37.23189665773341
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of \ours attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.

Related papers

AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature [55.66036140125613]
This paper presents a large language model (LLM) agent named AgentCAT, which extracts and analyzes catalytic reaction data from chemical engineering papers.<n>AgentCAT serves as an alternative to overcome the long-standing data bottleneck in chemical engineering field.
arXiv Detail & Related papers (2026-02-10T04:30:11Z)
RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning [51.393018266721576]
We propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP)<n>Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem.<n>We introduce a strategy termed "BBox and Index as Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image.
arXiv Detail & Related papers (2025-11-04T09:08:44Z)
A Multi-Agent System Enables Versatile Information Extraction from the Chemical Literature [8.306442315850878]
We develop a multimodal large language model (MLLM)-based multi-agent system for robust and automated chemical information extraction.<n>Our system achieved an F1 score of 80.8% on a benchmark dataset of sophisticated multimodal chemical reaction graphics from the literature.
arXiv Detail & Related papers (2025-07-27T11:16:57Z)
ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z)
Interpretable Deep Learning for Polar Mechanistic Reaction Prediction [43.95903801494905]
We introduce PMechRP (Polar Mechanistic Reaction Predictor), a system that trains machine learning models on the PMechDB dataset. We train compare a range of machine learning models, including transformer-based, graph-based and two-step siamese architectures. Our best-performing approach was a hybrid model, which combines a 5-ensemble of Chemformer models with a two-step Siamese framework.
arXiv Detail & Related papers (2025-04-22T02:31:23Z)
Towards Large-scale Chemical Reaction Image Parsing via a Multimodal Large Language Model [4.860497022313892]
We introduce the Reaction Image Multimodal large language model (RxnIM) to parse chemical reaction images into machine-readable data. RxnIM extracts key chemical components from reaction images and interprets the textual content that describes reaction conditions. Our approach achieves excellent performance, with an average F1 score of 88% on various benchmarks, surpassing literature methods by 5%.
arXiv Detail & Related papers (2025-03-11T08:11:23Z)
ReactXT: Understanding Molecular "Reaction-ship" via Reaction-Contextualized Molecule-Text Pretraining [76.51346919370005]
We propose ReactXT for reaction-text modeling and OpenExp for experimental procedure prediction. ReactXT features three types of input contexts to incrementally pretrain LMs. Our code is available at https://github.com/syr-cn/ReactXT.
arXiv Detail & Related papers (2024-05-23T06:55:59Z)
EnzChemRED, a rich enzyme chemistry relation extraction dataset [3.6124226106001]
EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated. We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text.
arXiv Detail & Related papers (2024-04-22T14:18:34Z)
An Autonomous Large Language Model Agent for Chemical Literature Data Mining [60.85177362167166]
We introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature. Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data.
arXiv Detail & Related papers (2024-02-20T13:21:46Z)
Retrosynthesis prediction enhanced by in-silico reaction data augmentation [66.5643280109899]
We present RetroWISE, a framework that employs a base model inferred from real paired data to perform in-silico reaction generation and augmentation. On three benchmark datasets, RetroWISE achieves the best overall performance against state-of-the-art models.
arXiv Detail & Related papers (2024-01-31T07:40:37Z)
Predictive Chemistry Augmented with Text Retrieval [37.59545092901872]
We introduce TextReact, a novel method that directly augments predictive chemistry with texts retrieved from the literature. TextReact retrieves text descriptions relevant for a given chemical reaction, and then aligns them with the molecular representation of the reaction. We empirically validate the framework on two chemistry tasks: reaction condition recommendation and one-step retrosynthesis.
arXiv Detail & Related papers (2023-12-08T07:40:59Z)
ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision [27.850325653751078]
structured chemical reaction information plays a vital role for chemists engaged in laboratory work and advanced endeavors such as computer-aided drug design. Despite the importance of extracting structured reactions from scientific literature, data annotation for this purpose is cost-prohibitive due to the significant labor required from domain experts. We propose ReactIE, which combines two weakly supervised approaches for pre-training. Our method utilizes frequent patterns within the text as linguistic cues to identify specific characteristics of chemical reactions.
arXiv Detail & Related papers (2023-07-04T02:52:30Z)
Structured information extraction from complex scientific text with fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z)
PcMSP: A Dataset for Scientific Action Graphs Extraction from Polycrystalline Materials Synthesis Procedure Text [1.9573380763700712]
This dataset simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations. A two-step human annotation and inter-annotator agreement study guarantee the high quality of the PcMSP corpus. We introduce four natural language processing tasks: sentence classification, named entity recognition, relation classification, and joint extraction of entities and relations.
arXiv Detail & Related papers (2022-10-22T09:43:54Z)
Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z)
Named entity recognition in chemical patents using ensemble of contextual language models [0.3731111830152912]
We study the effectiveness of contextualized language models to extract information from chemical patents. Our best model, based on a majority ensemble approach, achieves an exact F1-score of 92.30% and a relaxed F1-score of 96.24%.
arXiv Detail & Related papers (2020-07-24T15:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.