Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
- URL: http://arxiv.org/abs/2504.18880v2
- Date: Fri, 25 Jul 2025 10:08:19 GMT
- Title: Reshaping MOFs text mining with a dynamic multi-agents framework of large language model
- Authors: Zuhong Lin, Daoyuan Ren, Kai Ran, Jing Sun, Songlin Yu, Xuefeng Bai, Xiaotiang Huang, Haiyang He, Pengxu Pan, Xiaohang Zhang, Ying Fang, Tianying Wang, Minli Wu, Zhanglin Li, Xiaochuan Zhang, Haipu Li, Jingjing Yao,
- Abstract summary: We present MOFh6, a large language model (LLM)-based multi-agent system designed to extract, structure, and apply synthesis knowledge.<n>MoFh6 achieves 99% accuracy in synthesis data parsing and resolves 94.1% of complex co-reference abbreviations.<n>It processes a single full-text document in 9.6 seconds and localizes structured synthesis descriptions within 36 seconds, with the cost per 100 papers reduced to USD 4.24, a 76% saving over existing systems.
- Score: 5.150905688058796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurately identifying synthesis conditions for metal-organic frameworks (MOFs) remains a critical bottleneck in materials research, as translating literature-derived knowledge into actionable insights is hindered by the unstructured and heterogeneous nature of scientific texts. Here we present MOFh6, a large language model (LLM)-based multi-agent system designed to extract, structure, and apply synthesis knowledge from diverse input formats, including raw literature and crystal codes. Built on gpt-4o-mini and fine-tuned with up to few-shot expert-annotated data, MOFh6 achieves 99% accuracy in synthesis data parsing and resolves 94.1% of complex co-reference abbreviations. It processes a single full-text document in 9.6 seconds and localizes structured synthesis descriptions within 36 seconds, with the cost per 100 papers reduced to USD 4.24, a 76% saving over existing systems. By addressing long-standing limitations in cross-paragraph semantic fusion and terminology standardization, MOFh6 reshapes the LLM-based paradigm for MOF synthesis research, transforming static retrieval into an integrated and dynamic knowledge acquisition process. This shift bridges the gap between scientific literature and actionable synthesis design, providing a scalable framework for accelerating materials discovery.
Related papers
- Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation [4.875345207589195]
DocsRay is a training-free document understanding system.<n>It integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG)
arXiv Detail & Related papers (2025-07-31T03:14:45Z) - System of Agentic AI for the Discovery of Metal-Organic Frameworks [12.360146134865678]
Generative models and machine learning promise accelerated material discovery in MOFs for CO2 capture and water harvesting.<n>We present MOFGen, a system of Agentic AI comprising interconnected agents.<n>We generated hundreds of thousands of novel MOF structures and synthesizable organic linkers.
arXiv Detail & Related papers (2025-04-18T23:54:25Z) - Agentic Mixture-of-Workflows for Multi-Modal Chemical Search [0.0]
Large language models (LLMs) have demonstrated promising reasoning and automation capabilities across various domains.<n>We introduce CRAG-MoW - a novel paradigm that orchestrates multiple agentic employing distinct CRAG strategies.<n>We benchmark CRAG-MoWs across small molecules, polymers, and chemical reactions, as well as multi-modal nuclear magnetic resonance (NMR) spectral retrieval.
arXiv Detail & Related papers (2025-02-26T23:48:02Z) - RFL: Simplifying Chemical Structure Recognition with Ring-Free Language [66.47173094346115]
We propose a novel Ring-Free Language (RFL) to describe chemical structures in a hierarchical form.<n>RFL allows complex molecular structures to be decomposed into multiple parts, ensuring both uniqueness and conciseness.<n>We propose a universal Molecular Skeleton Decoder (MSD), which comprises a skeleton generation module that progressively predicts the molecular skeleton and individual rings.
arXiv Detail & Related papers (2024-12-10T15:29:32Z) - Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer.
We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging.
We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z) - MOFFlow: Flow Matching for Structure Prediction of Metal-Organic Frameworks [42.61784133509237]
Metal-organic frameworks (MOFs) are a class of crystalline materials with promising applications in many areas such as carbon capture and drug delivery.
Existing approaches, including ab initio calculations and even deep generative models, struggle with the complexity of MOF structures due to the large number of atoms in the unit cells.
We introduce MOFFlow, the first deep generative model tailored for MOF structure prediction.
arXiv Detail & Related papers (2024-10-07T13:51:58Z) - LLMs4Synthesis: Leveraging Large Language Models for Scientific Synthesis [0.16385815610837165]
This paper introduces the LLMs4Synthesis framework, designed to enhance the capabilities of Large Language Models (LLMs) in generating high-quality scientific syntheses.
It addresses the need for rapid, coherent, and contextually rich integration of scientific insights, leveraging both open-source and proprietary LLMs.
arXiv Detail & Related papers (2024-09-27T15:04:39Z) - Configurable Foundation Models: Building LLMs from a Modular Perspective [115.63847606634268]
A growing tendency to decompose LLMs into numerous functional modules allows for inference with part of modules and dynamic assembly of modules to tackle complex tasks.
We coin the term brick to represent each functional module, designating the modularized structure as customizable foundation models.
We present four brick-oriented operations: retrieval and routing, merging, updating, and growing.
We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions.
arXiv Detail & Related papers (2024-09-04T17:01:02Z) - Retrieval-Enhanced Machine Learning: Synthesis and Opportunities [60.34182805429511]
Retrieval-enhancement can be extended to a broader spectrum of machine learning (ML)
This work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature.
The goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
arXiv Detail & Related papers (2024-07-17T20:01:21Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z) - ChemMiner: A Large Language Model Agent System for Chemical Literature Data Mining [56.15126714863963]
ChemMiner is an end-to-end framework for extracting chemical data from literature.<n>ChemMiner incorporates three specialized agents: a text analysis agent for coreference mapping, a multimodal agent for non-textual information extraction, and a synthesis analysis agent for data generation.<n> Experimental results demonstrate reaction identification rates comparable to human chemists while significantly reducing processing time, with high accuracy, recall, and F1 scores.
arXiv Detail & Related papers (2024-02-20T13:21:46Z) - Model Composition for Multimodal Large Language Models [71.5729418523411]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - AutoIE: An Automated Framework for Information Extraction from
Scientific Literature [6.235887933544583]
AutoIE is a framework designed to automate the extraction of vital data from scientific PDF documents.
Our SBERT model achieves high Marco F1 scores of 87.19 and 89.65 on CoNLL04 and ADE datasets.
This research paves the way for enhanced data management and interpretation in molecular sieve synthesis.
arXiv Detail & Related papers (2024-01-30T01:45:03Z) - ChEF: A Comprehensive Evaluation Framework for Standardized Assessment
of Multimodal Large Language Models [49.48109472893714]
Multimodal Large Language Models (MLLMs) have shown impressive abilities in interacting with visual content with myriad potential downstream tasks.
We present the first Comprehensive Evaluation Framework (ChEF) that can holistically profile each MLLM and fairly compare different MLLMs.
We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models.
arXiv Detail & Related papers (2023-11-05T16:01:40Z) - MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design [4.819734936375677]
Metal-organic frameworks (MOFs) are of immense interest in applications such as gas storage and carbon capture.
We propose MOFDiff: a coarse-grained (CG) diffusion model that generates CG MOF structures.
We evaluate our model's capability to generate valid and novel MOF structures and its effectiveness in designing outstanding MOF materials.
arXiv Detail & Related papers (2023-10-16T18:00:15Z) - ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF
Synthesis [1.6889526065328493]
We use prompt engineering to guide ChatGPT in the automation of text mining of metal-organic frameworks (MOFs) synthesis conditions.
This effectively mitigates ChatGPT's tendency to hallucinate information.
arXiv Detail & Related papers (2023-06-20T05:20:29Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z) - Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from
Literature with GPT-3 [52.59930033705221]
We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.
We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.
arXiv Detail & Related papers (2023-04-26T22:21:33Z) - Building Open Knowledge Graph for Metal-Organic Frameworks (MOF-KG):
Challenges and Case Studies [63.61566811532431]
Metal-Organic Frameworks (MOFs) have great potential to revolutionize applications such as gas storage, molecular separations, chemical sensing, crystalline and drug delivery.
The Cambridge Structural Database (CSD) reports 10,636 synthesized MOF crystals which in addition contains ca. 114,373 MOF-like structures.
In this demo paper, we describe our effort on leveraging knowledge graph methods to facilitate MOF prediction, discovery, and synthesis.
arXiv Detail & Related papers (2022-07-10T16:41:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.