Integrating Chemistry Knowledge in Large Language Models via Prompt Engineering
- URL: http://arxiv.org/abs/2404.14467v1
- Date: Mon, 22 Apr 2024 16:55:44 GMT
- Title: Integrating Chemistry Knowledge in Large Language Models via Prompt Engineering
- Authors: Hongxuan Liu, Haoyu Yin, Zhiyao Luo, Xiaonan Wang,
- Abstract summary: This paper presents a study on the integration of domain-specific knowledge in prompt engineering to enhance the performance of large language models (LLMs) in scientific domains.
A benchmark dataset is curated to the intricate physical-chemical properties of small molecules, their drugability for pharmacology, alongside the functional attributes of enzymes and crystal materials.
The proposed domain-knowledge embedded prompt engineering method outperforms traditional prompt engineering strategies on various metrics.
- Score: 2.140221068402338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a study on the integration of domain-specific knowledge in prompt engineering to enhance the performance of large language models (LLMs) in scientific domains. A benchmark dataset is curated to encapsulate the intricate physical-chemical properties of small molecules, their drugability for pharmacology, alongside the functional attributes of enzymes and crystal materials, underscoring the relevance and applicability across biological and chemical domains.The proposed domain-knowledge embedded prompt engineering method outperforms traditional prompt engineering strategies on various metrics, including capability, accuracy, F1 score, and hallucination drop. The effectiveness of the method is demonstrated through case studies on complex materials including the MacMillan catalyst, paclitaxel, and lithium cobalt oxide. The results suggest that domain-knowledge prompts can guide LLMs to generate more accurate and relevant responses, highlighting the potential of LLMs as powerful tools for scientific discovery and innovation when equipped with domain-specific prompts. The study also discusses limitations and future directions for domain-specific prompt engineering development.
Related papers
- PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry [28.447446808292625]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by minimizing the need for complex feature engineering.
We introduce PharmaGPT, a suite of domain specilized LLMs specifically trained on the Bio-Pharmaceutical and Chemical domains.
Our evaluation shows that PharmaGPT surpasses existing general models on specific-domain benchmarks.
arXiv Detail & Related papers (2024-06-26T03:43:09Z) - CACTUS: Chemistry Agent Connecting Tool-Usage to Science [6.832077276041703]
Large language models (LLMs) have shown remarkable potential in various domains, but they often lack the ability to access and reason over domain-specific knowledge and tools.
We introduce CACTUS, an LLM-based agent that integrates cheminformatics tools to enable advanced reasoning and problem-solving in chemistry and molecular discovery.
We evaluate the performance of CACTUS using a diverse set of open-source LLMs, including Gemma-7b, Falcon-7b, MPT-7b, Llama2-7b, and Mistral-7b, on a benchmark of thousands of chemistry questions.
arXiv Detail & Related papers (2024-05-02T03:20:08Z) - DRAK: Unlocking Molecular Insights with Domain-Specific Retrieval-Augmented Knowledge in LLMs [6.728130796437259]
Domain-specific Retrieval-Augmented Knowledge (DRAK) is a non-parametric knowledge injection framework for large language models.
DRAK has developed profound expertise in the molecular domain and the capability to handle a broad spectrum of analysis tasks.
Our code will be available soon.
arXiv Detail & Related papers (2024-03-04T15:04:05Z) - An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - An Autonomous Large Language Model Agent for Chemical Literature Data
Mining [60.85177362167166]
We introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature.
Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data.
arXiv Detail & Related papers (2024-02-20T13:21:46Z) - Scientific Large Language Models: A Survey on Biological & Chemical Domains [47.97810890521825]
Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension.
The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized linguistic systems developed within various scientific disciplines.
As a burgeoning area in the community of AI for Science, scientific LLMs warrant comprehensive exploration.
arXiv Detail & Related papers (2024-01-26T05:33:34Z) - Chemist-X: Large Language Model-empowered Agent for Reaction Condition Recommendation in Chemical Synthesis [57.70772230913099]
Chemist-X automates the reaction condition recommendation (RCR) task in chemical synthesis with retrieval-augmented generation (RAG) technology.
Chemist-X interrogates online molecular databases and distills critical data from the latest literature database.
Chemist-X considerably reduces chemists' workload and allows them to focus on more fundamental and creative problems.
arXiv Detail & Related papers (2023-11-16T01:21:33Z) - Knowledge-augmented Graph Machine Learning for Drug Discovery: A Survey [6.288056740658763]
Graph Machine Learning (GML) has gained considerable attention for its exceptional ability to model graph-structured biomedical data.
Recent studies have proposed integrating external biomedical knowledge into the GML pipeline to realise more precise and interpretable drug discovery.
arXiv Detail & Related papers (2023-02-16T12:38:01Z) - Improving Molecular Representation Learning with Metric
Learning-enhanced Optimal Transport [49.237577649802034]
We develop a novel optimal transport-based algorithm termed MROT to enhance their generalization capability for molecular regression problems.
MROT significantly outperforms state-of-the-art models, showing promising potential in accelerating the discovery of new substances.
arXiv Detail & Related papers (2022-02-13T04:56:18Z) - Machine Learning in Nano-Scale Biomedical Engineering [77.75587007080894]
We review the existing research regarding the use of machine learning in nano-scale biomedical engineering.
The main challenges that can be formulated as ML problems are classified into the three main categories.
For each of the presented methodologies, special emphasis is given to its principles, applications, and limitations.
arXiv Detail & Related papers (2020-08-05T15:45:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.