Related papers: EnzChemRED, a rich enzyme chemistry relation extraction dataset

EnzChemRED, a rich enzyme chemistry relation extraction dataset

URL: http://arxiv.org/abs/2404.14209v1
Date: Mon, 22 Apr 2024 14:18:34 GMT
Title: EnzChemRED, a rich enzyme chemistry relation extraction dataset
Authors: Po-Ting Lai, Elisabeth Coudert, Lucila Aimo, Kristian Axelsen, Lionel Breuza, Edouard de Castro, Marc Feuermann, Anne Morgat, Lucille Pourcel, Ivo Pedruzzi, Sylvain Poux, Nicole Redaschi, Catherine Rivoire, Anastasia Sveshnikova, Chih-Hsuan Wei, Robert Leaman, Ling Luo, Zhiyong Lu, Alan Bridge,
Abstract summary: EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated. We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text.
Score: 3.6124226106001
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated using identifiers from the UniProt Knowledgebase (UniProtKB) and the ontology of Chemical Entities of Biological Interest (ChEBI). We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text (Named Entity Recognition, or NER) and to extract the chemical conversions in which they participate (Relation Extraction, or RE), with average F1 score of 86.30% for NER, 86.66% for RE for chemical conversion pairs, and 83.79% for RE for chemical conversion pairs and linked enzymes. We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text and apply this to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea. The EnzChemRED corpus is freely available at https://ftp.expasy.org/databases/rhea/nlp/.

Related papers

AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature [55.66036140125613]
This paper presents a large language model (LLM) agent named AgentCAT, which extracts and analyzes catalytic reaction data from chemical engineering papers.<n>AgentCAT serves as an alternative to overcome the long-standing data bottleneck in chemical engineering field.
arXiv Detail & Related papers (2026-02-10T04:30:11Z)
Multimodal Regression for Enzyme Turnover Rates Prediction [57.60697333734054]
We propose a framework for predicting the enzyme turnover rate by integrating enzyme sequences, substrate structures, and environmental factors.<n>Our model combines a pre-trained language model and a convolutional neural network to extract features from protein sequences.<n>We leverage symbolic regression via Kolmogorov-Arnold Networks to explicitly learn mathematical formulas that govern the enzyme turnover rate.
arXiv Detail & Related papers (2025-09-15T11:07:26Z)
zERExtractor:An Automated Platform for Enzyme-Catalyzed Reaction Data Extraction from Scientific Literature [12.109637682144125]
zERExtractor is an automated platform for comprehensive extraction of enzyme-catalyzed reaction and activity data from scientific literature.<n>Our pipeline combines domain-adapted deep learning, advanced OCR, semantic entity recognition, and prompt-driven LLM modules.<n>We release a large benchmark dataset comprising over 1,000 annotated tables and 5,000 biological fields from 270 P450-related enzymology publications.
arXiv Detail & Related papers (2025-07-30T07:21:32Z)
ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z)
BioChemInsight: An Open-Source Toolkit for Automated Identification and Recognition of Optical Chemical Structures and Activity Data in Scientific Publications [25.764592266678132]
Existing optical chemical structure recognition tools fail to autonomously associate molecular structures with their bioactivity profiles. BioChemInsight is an open-source pipeline that integrates DECIMER and MolVec for chemical structure recognition, Qwen2.5-VL-32B for compound identifier association, and PaddleOCR for bioactivity extraction and unit normalization. System generates ready-to-use SAR datasets, reducing data preprocessing time from weeks to hours.
arXiv Detail & Related papers (2025-04-12T04:56:44Z)
BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction [65.93303145891628]
BatGPT-Chem is a large language model with 15 billion parameters, tailored for enhanced retrosynthesis prediction. Our model captures a broad spectrum of chemical knowledge, enabling precise prediction of reaction conditions. This development empowers chemists to adeptly address novel compounds, potentially expediting the innovation cycle in drug manufacturing and materials science.
arXiv Detail & Related papers (2024-08-19T05:17:40Z)
OpenChemIE: An Information Extraction Toolkit For Chemistry Literature [37.23189665773341]
OpenChemIE is a tool for extracting reaction data from chemistry literature. We employ specialized neural models that address a specific task for chemistry information extraction. We meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole.
arXiv Detail & Related papers (2024-04-01T20:16:21Z)
An Autonomous Large Language Model Agent for Chemical Literature Data Mining [60.85177362167166]
We introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature. Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data.
arXiv Detail & Related papers (2024-02-20T13:21:46Z)
Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis [57.70772230913099]
Chemist-X automates the reaction condition recommendation (RCR) task in chemical synthesis with retrieval-augmented generation (RAG) technology. Chemist-X interrogates online molecular databases and distills critical data from the latest literature database. Chemist-X considerably reduces chemists' workload and allows them to focus on more fundamental and creative problems.
arXiv Detail & Related papers (2023-11-16T01:21:33Z)
ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision [27.850325653751078]
structured chemical reaction information plays a vital role for chemists engaged in laboratory work and advanced endeavors such as computer-aided drug design. Despite the importance of extracting structured reactions from scientific literature, data annotation for this purpose is cost-prohibitive due to the significant labor required from domain experts. We propose ReactIE, which combines two weakly supervised approaches for pre-training. Our method utilizes frequent patterns within the text as linguistic cues to identify specific characteristics of chemical reactions.
arXiv Detail & Related papers (2023-07-04T02:52:30Z)
End-to-End Models for Chemical-Protein Interaction Extraction: Better Tokenization and Span-Based Pipeline Strategies [1.782718930156674]
We employ a span-based pipeline approach to produce a new state-of-the-art E2ERE performance on the ChemProt dataset. Our results indicate that a straightforward fine-grained tokenization scheme helps span-based approaches excel in E2ERE.
arXiv Detail & Related papers (2023-04-03T20:20:22Z)
Tailoring Molecules for Protein Pockets: a Transformer-based Generative Solution for Structured-based Drug Design [133.1268990638971]
De novo drug design based on the structure of a target protein can provide novel drug candidates. We present a generative solution named TamGent that can directly generate candidate drugs from scratch for a given target.
arXiv Detail & Related papers (2022-08-30T09:32:39Z)
Improving Molecular Representation Learning with Metric Learning-enhanced Optimal Transport [49.237577649802034]
We develop a novel optimal transport-based algorithm termed MROT to enhance their generalization capability for molecular regression problems. MROT significantly outperforms state-of-the-art models, showing promising potential in accelerating the discovery of new substances.
arXiv Detail & Related papers (2022-02-13T04:56:18Z)
Chemical-Reaction-Aware Molecule Representation Learning [88.79052749877334]
We propose using chemical reactions to assist learning molecule representation. Our approach is proven effective to 1) keep the embedding space well-organized and 2) improve the generalization ability of molecule embeddings. Experimental results demonstrate that our method achieves state-of-the-art performance in a variety of downstream tasks.
arXiv Detail & Related papers (2021-09-21T00:08:43Z)
Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets. Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z)
Named entity recognition in chemical patents using ensemble of contextual language models [0.3731111830152912]
We study the effectiveness of contextualized language models to extract information from chemical patents. Our best model, based on a majority ensemble approach, achieves an exact F1-score of 92.30% and a relaxed F1-score of 96.24%.
arXiv Detail & Related papers (2020-07-24T15:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.