zERExtractor:An Automated Platform for Enzyme-Catalyzed Reaction Data Extraction from Scientific Literature
- URL: http://arxiv.org/abs/2508.09995v1
- Date: Wed, 30 Jul 2025 07:21:32 GMT
- Title: zERExtractor:An Automated Platform for Enzyme-Catalyzed Reaction Data Extraction from Scientific Literature
- Authors: Rui Zhou, Haohui Ma, Tianle Xin, Lixin Zou, Qiuyue Hu, Hongxi Cheng, Mingzhi Lin, Jingjing Guo, Sheng Wang, Guoqing Zhang, Yanjie Wei, Liangzhen Zheng,
- Abstract summary: zERExtractor is an automated platform for comprehensive extraction of enzyme-catalyzed reaction and activity data from scientific literature.<n>Our pipeline combines domain-adapted deep learning, advanced OCR, semantic entity recognition, and prompt-driven LLM modules.<n>We release a large benchmark dataset comprising over 1,000 annotated tables and 5,000 biological fields from 270 P450-related enzymology publications.
- Score: 12.109637682144125
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid expansion of enzyme kinetics literature has outpaced the curation capabilities of major biochemical databases, creating a substantial barrier to AI-driven modeling and knowledge discovery. We present zERExtractor, an automated and extensible platform for comprehensive extraction of enzyme-catalyzed reaction and activity data from scientific literature. zERExtractor features a unified, modular architecture that supports plug-and-play integration of state-of-the-art models, including large language models (LLMs), as interchangeable components, enabling continuous system evolution alongside advances in AI. Our pipeline combines domain-adapted deep learning, advanced OCR, semantic entity recognition, and prompt-driven LLM modules, together with human expert corrections, to extract kinetic parameters (e.g., kcat, Km), enzyme sequences, substrate SMILES, experimental conditions, and molecular diagrams from heterogeneous document formats. Through active learning strategies integrating AI-assisted annotation, expert validation, and iterative refinement, the system adapts rapidly to new data sources. We also release a large benchmark dataset comprising over 1,000 annotated tables and 5,000 biological fields from 270 P450-related enzymology publications. Benchmarking demonstrates that zERExtractor consistently outperforms existing baselines in table recognition (Acc 89.9%), molecular image interpretation (up to 99.1%), and relation extraction (accuracy 94.2%). zERExtractor bridges the longstanding data gap in enzyme kinetics with a flexible, plugin-ready framework and high-fidelity extraction, laying the groundwork for future AI-powered enzyme modeling and biochemical knowledge discovery.
Related papers
- AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature [55.66036140125613]
This paper presents a large language model (LLM) agent named AgentCAT, which extracts and analyzes catalytic reaction data from chemical engineering papers.<n>AgentCAT serves as an alternative to overcome the long-standing data bottleneck in chemical engineering field.
arXiv Detail & Related papers (2026-02-10T04:30:11Z) - AGAPI-Agents: An Open-Access Agentic AI Platform for Accelerated Materials Design on AtomGPT.org [0.8093011368737527]
AGAPI (AtomGPT.org API) is an open-access agentic AI platform that integrates more than eight open-sources with over twenty materials-science API endpoints.<n>We demonstrate AGAPI through end-to-end construction, including heterostructure construction, powder X-ray diffraction analysis, and semiconductor defect engineering.<n>With more than 1,000 active users, AGAPI provides a scalable and transparent foundation for reproducible, AI-accelerated materials discovery.
arXiv Detail & Related papers (2025-12-12T06:28:28Z) - LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature [60.879220305044726]
We propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data.<n>We curated 81k open-access papers, yielding LeMat- Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes.<n>We release a modular, open-source library designed to support community-driven extension to new corpora and synthesis domains.
arXiv Detail & Related papers (2025-10-28T17:58:18Z) - HySemRAG: A Hybrid Semantic Retrieval-Augmented Generation Framework for Automated Literature Synthesis and Methodological Gap Analysis [55.2480439325792]
HySemRAG is a framework that combines Extract, Transform, Load (ETL) pipelines with Retrieval-Augmented Generation (RAG)<n>System addresses limitations in existing RAG architectures through a multi-layered approach.
arXiv Detail & Related papers (2025-08-01T20:30:42Z) - A Multi-Agent System Enables Versatile Information Extraction from the Chemical Literature [8.306442315850878]
We develop a multimodal large language model (MLLM)-based multi-agent system for robust and automated chemical information extraction.<n>Our system achieved an F1 score of 80.8% on a benchmark dataset of sophisticated multimodal chemical reaction graphics from the literature.
arXiv Detail & Related papers (2025-07-27T11:16:57Z) - ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z) - DrugPilot: LLM-based Parameterized Reasoning Agent for Drug Discovery [54.79763887844838]
Large language models (LLMs) integrated with autonomous agents hold significant potential for advancing scientific discovery through automated reasoning and task execution.<n>We introduce DrugPilot, a LLM-based agent system with a parameterized reasoning architecture designed for end-to-end scientific in drug discovery.<n>DrugPilot significantly outperforms state-of-the-art agents such as ReAct and LoT, achieving task completion rates of 98.0%, 93.5%, and 64.0% for simple, multi-tool, and multi-turn scenarios, respectively.
arXiv Detail & Related papers (2025-05-20T05:18:15Z) - EnzChemRED, a rich enzyme chemistry relation extraction dataset [3.6124226106001]
EnzChemRED consists of 1,210 expert curated PubMed abstracts in which enzymes and the chemical reactions they catalyze are annotated.
We show that fine-tuning pre-trained language models with EnzChemRED can significantly boost their ability to identify mentions of proteins and chemicals in text.
We combine the best performing methods after fine-tuning using EnzChemRED to create an end-to-end pipeline for knowledge extraction from text.
arXiv Detail & Related papers (2024-04-22T14:18:34Z) - Extracting Protein-Protein Interactions (PPIs) from Biomedical
Literature using Attention-based Relational Context Information [5.456047952635665]
This work presents a unified, multi-source PPI corpora with vetted interaction definitions augmented by binary interaction type labels.
A Transformer-based deep learning method exploits entities' relational context information for relation representation to improve relation classification performance.
The model's performance is evaluated on four widely studied biomedical relation extraction datasets.
arXiv Detail & Related papers (2024-03-08T01:43:21Z) - ChemMiner: A Large Language Model Agent System for Chemical Literature Data Mining [56.15126714863963]
ChemMiner is an end-to-end framework for extracting chemical data from literature.<n>ChemMiner incorporates three specialized agents: a text analysis agent for coreference mapping, a multimodal agent for non-textual information extraction, and a synthesis analysis agent for data generation.<n> Experimental results demonstrate reaction identification rates comparable to human chemists while significantly reducing processing time, with high accuracy, recall, and F1 scores.
arXiv Detail & Related papers (2024-02-20T13:21:46Z) - MolTrans: Molecular Interaction Transformer for Drug Target Interaction
Prediction [68.5766865583049]
Drug target interaction (DTI) prediction is a foundational task for in silico drug discovery.
Recent years have witnessed promising progress for deep learning in DTI predictions.
We propose a Molecular Interaction Transformer (TransMol) to address these limitations.
arXiv Detail & Related papers (2020-04-23T18:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.