Related papers: ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis

ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis

URL: http://arxiv.org/abs/2306.11296v2
Date: Thu, 20 Jul 2023 02:20:35 GMT
Title: ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis
Authors: Zhiling Zheng, Oufan Zhang, Christian Borgs, Jennifer T. Chayes, Omar M. Yaghi
Abstract summary: We use prompt engineering to guide ChatGPT in the automation of text mining of metal-organic frameworks (MOFs) synthesis conditions. This effectively mitigates ChatGPT's tendency to hallucinate information.
Score: 1.6889526065328493
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We use prompt engineering to guide ChatGPT in the automation of text mining of metal-organic frameworks (MOFs) synthesis conditions from diverse formats and styles of the scientific literature. This effectively mitigates ChatGPT's tendency to hallucinate information -- an issue that previously made the use of Large Language Models (LLMs) in scientific fields challenging. Our approach involves the development of a workflow implementing three different processes for text mining, programmed by ChatGPT itself. All of them enable parsing, searching, filtering, classification, summarization, and data unification with different tradeoffs between labor, speed, and accuracy. We deploy this system to extract 26,257 distinct synthesis parameters pertaining to approximately 800 MOFs sourced from peer-reviewed research articles. This process incorporates our ChemPrompt Engineering strategy to instruct ChatGPT in text mining, resulting in impressive precision, recall, and F1 scores of 90-99%. Furthermore, with the dataset built by text mining, we constructed a machine-learning model with over 86% accuracy in predicting MOF experimental crystallization outcomes and preliminarily identifying important factors in MOF crystallization. We also developed a reliable data-grounded MOF chatbot to answer questions on chemical reactions and synthesis procedures. Given that the process of using ChatGPT reliably mines and tabulates diverse MOF synthesis information in a unified format, while using only narrative language requiring no coding expertise, we anticipate that our ChatGPT Chemistry Assistant will be very useful across various other chemistry sub-disciplines.

Related papers

ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z)
Reshaping MOFs text mining with a dynamic multi-agents framework of large language model [5.150905688058796]
We present MOFh6, a large language model (LLM)-based multi-agent system designed to extract, structure, and apply synthesis knowledge.<n>MoFh6 achieves 99% accuracy in synthesis data parsing and resolves 94.1% of complex co-reference abbreviations.<n>It processes a single full-text document in 9.6 seconds and localizes structured synthesis descriptions within 36 seconds, with the cost per 100 papers reduced to USD 4.24, a 76% saving over existing systems.
arXiv Detail & Related papers (2025-04-26T09:55:04Z)
Text to Band Gap: Pre-trained Language Models as Encoders for Semiconductor Band Gap Prediction [5.812284760539713]
We investigate the use of transformer-based language models, RoBERTa, T5, and LLaMA, for predicting the band gaps of semiconductor materials.<n>We construct material descriptions in two formats: structured strings that combine key features in a consistent template, and natural language narratives generated using the ChatGPT API.<n>Our results show that finetuned language models, particularly the decoder-only LLaMA-3 architecture, can outperform conventional approaches in prediction accuracy and flexibility.
arXiv Detail & Related papers (2025-01-07T00:56:26Z)
BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction [65.93303145891628]
BatGPT-Chem is a large language model with 15 billion parameters, tailored for enhanced retrosynthesis prediction. Our model captures a broad spectrum of chemical knowledge, enabling precise prediction of reaction conditions. This development empowers chemists to adeptly address novel compounds, potentially expediting the innovation cycle in drug manufacturing and materials science.
arXiv Detail & Related papers (2024-08-19T05:17:40Z)
Single and Multi-Hop Question-Answering Datasets for Reticular Chemistry with GPT-4-Turbo [0.5110571587151475]
'RetChemQA' is a benchmark dataset designed to evaluate the capabilities of machine learning models in the domain of reticular chemistry. This dataset includes both single-hop and multi-hop question-answer pairs, encompassing approximately 45,000 Q&As for each type. The questions have been extracted from an extensive corpus of literature containing about 2,530 research papers from publishers including NAS, ACS, RSC, Elsevier, and Nature Publishing Group.
arXiv Detail & Related papers (2024-05-03T14:29:54Z)
An Autonomous Large Language Model Agent for Chemical Literature Data Mining [60.85177362167166]
We introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature. Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data.
arXiv Detail & Related papers (2024-02-20T13:21:46Z)
Language Models as Science Tutors [79.73256703631492]
We introduce TutorEval and TutorChat to measure real-life usability of LMs as scientific assistants. We show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH.
arXiv Detail & Related papers (2024-02-16T22:24:13Z)
Image and Data Mining in Reticular Chemistry Using GPT-4V [5.440238820637818]
GPT-4V is a large language model featuring enhanced vision capabilities, accessible through ChatGPT or an API. This study demonstrates the remarkable ability of GPT-4V to navigate and obtain complex data for metal-organic frameworks.
arXiv Detail & Related papers (2023-12-09T05:05:25Z)
GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction [6.349503549199403]
We present GPT-MolBERTa, a self-supervised large language model (LLM) which uses detailed textual descriptions of molecules to predict their properties. A text based description of 326000 molecules were collected using ChatGPT and used to train LLM to learn the representation of molecules. Experiments show that GPT-MolBERTa performs well on various molecule property benchmarks, and approaching state of the art performance in regression tasks.
arXiv Detail & Related papers (2023-09-20T17:21:43Z)
Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining. We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data. Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z)
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks. We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset. ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z)
Structured information extraction from complex scientific text with fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z)
Named entity recognition in chemical patents using ensemble of contextual language models [0.3731111830152912]
We study the effectiveness of contextualized language models to extract information from chemical patents. Our best model, based on a majority ensemble approach, achieves an exact F1-score of 92.30% and a relaxed F1-score of 96.24%.
arXiv Detail & Related papers (2020-07-24T15:23:45Z)
Annotating and Extracting Synthesis Process of All-Solid-State Batteries from Scientific Literature [10.443499579567069]
We present a novel corpus of the synthesis process for all-solid-state batteries and an automated machine reading system. We define the representation of the synthesis processes using flow graphs, and create a corpus from the experimental sections of 243 papers. The automated machine-reading system is developed by a deep learning-based sequence tagger and simple rule-based relation extractor.
arXiv Detail & Related papers (2020-02-18T02:30:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.