Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
- URL: http://arxiv.org/abs/2504.18880v3
- Date: Fri, 08 Aug 2025 08:35:50 GMT
- Title: Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
- Authors: Zuhong Lin, Daoyuan Ren, Kai Ran, Jing Sun, Songlin Yu, Xuefeng Bai, Xiaotian Huang, Haiyang He, Pengxu Pan, Ying Fang, Zhanglin Li, Haipu Li, Jingjing Yao,
- Abstract summary: We present MOFh6, a large language driven system that reads raw articles or crystal codes and converts them into standardized synthesis tables.<n>MOFh6 achieved 99% extraction accuracy, resolved 94.1% of abbreviation cases across five major publishers, and maintained a precision of 0.93 +/- 0.01.
- Score: 4.285805877963645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurately identifying the synthesis conditions of metal-organic frameworks (MOFs) is essential for guiding experimental design, yet remains challenging because relevant information in the literature is often scattered, inconsistent, and difficult to interpret. We present MOFh6, a large language model driven system that reads raw articles or crystal codes and converts them into standardized synthesis tables. It links related descriptions across paragraphs, unifies ligand abbreviations with full names, and outputs structured parameters ready for use. MOFh6 achieved 99% extraction accuracy, resolved 94.1% of abbreviation cases across five major publishers, and maintained a precision of 0.93 +/- 0.01. Processing a full text takes 9.6 s, locating synthesis descriptions 36 s, with 100 papers processed for USD 4.24. By replacing static database lookups with real-time extraction, MOFh6 reshapes MOF synthesis research, accelerating the conversion of literature knowledge into practical synthesis protocols and enabling scalable, data-driven materials discovery.
Related papers
- LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature [60.879220305044726]
We propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data.<n>We curated 81k open-access papers, yielding LeMat- Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes.<n>We release a modular, open-source library designed to support community-driven extension to new corpora and synthesis domains.
arXiv Detail & Related papers (2025-10-28T17:58:18Z) - ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature [0.2447206672789868]
ComProScanner is an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of chemical compositions and properties.<n>We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models.<n>DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82.
arXiv Detail & Related papers (2025-10-23T09:01:44Z) - HySemRAG: A Hybrid Semantic Retrieval-Augmented Generation Framework for Automated Literature Synthesis and Methodological Gap Analysis [55.2480439325792]
HySemRAG is a framework that combines Extract, Transform, Load (ETL) pipelines with Retrieval-Augmented Generation (RAG)<n>System addresses limitations in existing RAG architectures through a multi-layered approach.
arXiv Detail & Related papers (2025-08-01T20:30:42Z) - Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation [4.875345207589195]
DocsRay is a training-free document understanding system.<n>It integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG)
arXiv Detail & Related papers (2025-07-31T03:14:45Z) - ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z) - System of Agentic AI for the Discovery of Metal-Organic Frameworks [12.360146134865678]
Generative models and machine learning promise accelerated material discovery in MOFs for CO2 capture and water harvesting.<n>We present MOFGen, a system of Agentic AI comprising interconnected agents.<n>We generated hundreds of thousands of novel MOF structures and synthesizable organic linkers.
arXiv Detail & Related papers (2025-04-18T23:54:25Z) - Agentic Mixture-of-Workflows for Multi-Modal Chemical Search [0.0]
Large language models (LLMs) have demonstrated promising reasoning and automation capabilities across various domains.<n>We introduce CRAG-MoW - a novel paradigm that orchestrates multiple agentic employing distinct CRAG strategies.<n>We benchmark CRAG-MoWs across small molecules, polymers, and chemical reactions, as well as multi-modal nuclear magnetic resonance (NMR) spectral retrieval.
arXiv Detail & Related papers (2025-02-26T23:48:02Z) - RFL: Simplifying Chemical Structure Recognition with Ring-Free Language [66.47173094346115]
We propose a novel Ring-Free Language (RFL) to describe chemical structures in a hierarchical form.<n>RFL allows complex molecular structures to be decomposed into multiple parts, ensuring both uniqueness and conciseness.<n>We propose a universal Molecular Skeleton Decoder (MSD), which comprises a skeleton generation module that progressively predicts the molecular skeleton and individual rings.
arXiv Detail & Related papers (2024-12-10T15:29:32Z) - Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer.
We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging.
We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z) - MOFFlow: Flow Matching for Structure Prediction of Metal-Organic Frameworks [42.61784133509237]
Metal-organic frameworks (MOFs) are a class of crystalline materials with promising applications in many areas such as carbon capture and drug delivery.
Existing approaches, including ab initio calculations and even deep generative models, struggle with the complexity of MOF structures due to the large number of atoms in the unit cells.
We introduce MOFFlow, the first deep generative model tailored for MOF structure prediction.
arXiv Detail & Related papers (2024-10-07T13:51:58Z) - LLMs4Synthesis: Leveraging Large Language Models for Scientific Synthesis [0.16385815610837165]
This paper introduces the LLMs4Synthesis framework, designed to enhance the capabilities of Large Language Models (LLMs) in generating high-quality scientific syntheses.
It addresses the need for rapid, coherent, and contextually rich integration of scientific insights, leveraging both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-09-27T15:04:39Z) - Configurable Foundation Models: Building LLMs from a Modular Perspective [115.63847606634268]
A growing tendency to decompose LLMs into numerous functional modules allows for inference with part of modules and dynamic assembly of modules to tackle complex tasks.
We coin the term brick to represent each functional module, designating the modularized structure as customizable foundation models.
We present four brick-oriented operations: retrieval and routing, merging, updating, and growing.
We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions.
arXiv Detail & Related papers (2024-09-04T17:01:02Z) - Retrieval-Enhanced Machine Learning: Synthesis and Opportunities [60.34182805429511]
Retrieval-enhancement can be extended to a broader spectrum of machine learning (ML)
This work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature.
The goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
arXiv Detail & Related papers (2024-07-17T20:01:21Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z) - ChemMiner: A Large Language Model Agent System for Chemical Literature Data Mining [56.15126714863963]
ChemMiner is an end-to-end framework for extracting chemical data from literature.<n>ChemMiner incorporates three specialized agents: a text analysis agent for coreference mapping, a multimodal agent for non-textual information extraction, and a synthesis analysis agent for data generation.<n> Experimental results demonstrate reaction identification rates comparable to human chemists while significantly reducing processing time, with high accuracy, recall, and F1 scores.
arXiv Detail & Related papers (2024-02-20T13:21:46Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - AutoIE: An Automated Framework for Information Extraction from
Scientific Literature [6.235887933544583]
AutoIE is a framework designed to automate the extraction of vital data from scientific PDF documents.
Our SBERT model achieves high Marco F1 scores of 87.19 and 89.65 on CoNLL04 and ADE datasets.
This research paves the way for enhanced data management and interpretation in molecular sieve synthesis.
arXiv Detail & Related papers (2024-01-30T01:45:03Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - ChEF: A Comprehensive Evaluation Framework for Standardized Assessment
of Multimodal Large Language Models [49.48109472893714]
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks.
We present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs.
We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models.
arXiv Detail & Related papers (2023-11-05T16:01:40Z) - MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design [4.819734936375677]
Metal-organic frameworks (MOFs) are of immense interest in applications such as gas storage and carbon capture.
We propose MOFDiff: a coarse-grained (CG) diffusion model that generates CG MOF structures.
We evaluate our model's capability to generate valid and novel MOF structures and its effectiveness in designing outstanding MOF materials.
arXiv Detail & Related papers (2023-10-16T18:00:15Z) - ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF
Synthesis [1.6889526065328493]
We use prompt engineering to guide ChatGPT in the automation of text mining of metal-organic frameworks (MOFs) synthesis conditions.
This effectively mitigates ChatGPT's tendency to hallucinate information.
arXiv Detail & Related papers (2023-06-20T05:20:29Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z) - Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from
Literature with GPT-3 [52.59930033705221]
We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.
We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.
arXiv Detail & Related papers (2023-04-26T22:21:33Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - Building Open Knowledge Graph for Metal-Organic Frameworks (MOF-KG):
Challenges and Case Studies [63.61566811532431]
Metal-Organic Frameworks (MOFs) have great potential to revolutionize applications such as gas storage, molecular separations, chemical sensing, crystalline and drug delivery.
The Cambridge Structural Database (CSD) reports 10,636 synthesized MOF crystals which in addition contains ca. 114,373 MOF-like structures.
In this demo paper, we describe our effort on leveraging knowledge graph methods to facilitate MOF prediction, discovery, and synthesis.
arXiv Detail & Related papers (2022-07-10T16:41:11Z) - Annotating and Extracting Synthesis Process of All-Solid-State Batteries
from Scientific Literature [10.443499579567069]
We present a novel corpus of the synthesis process for all-solid-state batteries and an automated machine reading system.
We define the representation of the synthesis processes using flow graphs, and create a corpus from the experimental sections of 243 papers.
The automated machine-reading system is developed by a deep learning-based sequence tagger and simple rule-based relation extractor.
arXiv Detail & Related papers (2020-02-18T02:30:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.