Related papers: LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

URL: http://arxiv.org/abs/2402.09391v3
Date: Mon, 1 Apr 2024 17:28:16 GMT
Title: LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset
Authors: Botao Yu, Frazier N. Baker, Ziqi Chen, Xia Ning, Huan Sun,
Abstract summary: We show that large language models (LLMs) can achieve very strong results on a comprehensive set of chemistry tasks. We propose SMolInstruct, a large-scale, comprehensive, and high-quality dataset for instruction tuning. Using SMolInstruct, we fine-tune a set of open-source LLMs, among which, we find that Mistral serves as the best base model for chemistry tasks.
Score: 13.063678216852473
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chemistry plays a crucial role in many domains, such as drug discovery and material science. While large language models (LLMs) such as GPT-4 exhibit remarkable capabilities on natural language processing tasks, existing research indicates that their performance on chemistry tasks is discouragingly low. In this paper, however, we demonstrate that our developed LLMs can achieve very strong results on a comprehensive set of chemistry tasks, outperforming the most advanced GPT-4 and Claude 3 Opus by a substantial margin. To accomplish this, we propose SMolInstruct, a large-scale, comprehensive, and high-quality dataset for instruction tuning. It contains 14 selected chemistry tasks and over three million samples, laying a solid foundation for training and evaluating LLMs for chemistry. Using SMolInstruct, we fine-tune a set of open-source LLMs, among which, we find that Mistral serves as the best base model for chemistry tasks. Our analysis further demonstrates the critical role of the proposed dataset in driving the performance improvements.

Related papers

CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning [12.745398618084474]
Large language models (LLMs) have recently demonstrated promising capabilities in chemistry tasks.<n>We propose an LLM-based agent that integrates 137 external chemical tools created ranging from basic information retrieval to complex reaction predictions.
arXiv Detail & Related papers (2025-06-09T08:41:39Z)
ChemMLLM: Chemical Multimodal Large Language Model [52.95382215206681]
We propose ChemMLLM, a unified chemical multimodal large language model for molecule understanding and generation.<n>Also, we design five multimodal tasks across text, molecular SMILES strings, and image, and curate the datasets.<n> Experimental results show that ChemMLLM achieves superior performance across all evaluated tasks.
arXiv Detail & Related papers (2025-05-22T07:32:17Z)
ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning [64.2106664137118]
ChemAgent is a novel framework designed to improve the performance of large language models (LLMs) It is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries. When presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory.
arXiv Detail & Related papers (2025-01-11T17:10:30Z)
From Generalist to Specialist: A Survey of Large Language Models for Chemistry [14.317448405387195]
Large Language Models (LLMs) have significantly transformed our daily life and established a new paradigm in natural language processing (NLP) The predominant pretraining of LLMs on extensive web-based texts remains insufficient for advanced scientific discovery, particularly in chemistry. Although several studies have reviewed Pretrained Language Models (PLMs) in chemistry, there is a conspicuous absence of a systematic survey specifically focused on chemistry-oriented LLMs.
arXiv Detail & Related papers (2024-12-28T03:40:25Z)
ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models [62.37850540570268]
Existing benchmarks in this domain fail to adequately meet the specific requirements of chemical research professionals. ChemEval identifies 4 crucial progressive levels in chemistry, assessing 12 dimensions of LLMs across 42 distinct chemical tasks. Results show that while general LLMs excel in literature understanding and instruction following, they fall short in tasks demanding advanced chemical knowledge.
arXiv Detail & Related papers (2024-09-21T02:50:43Z)
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area [50.15254966969718]
We introduce textbfChemVLM, an open-source chemical multimodal large language model for chemical applications. ChemVLM is trained on a carefully curated bilingual dataset that enhances its ability to understand both textual and visual chemical information. We benchmark ChemVLM against a range of open-source and proprietary multimodal large language models on various tasks.
arXiv Detail & Related papers (2024-08-14T01:16:40Z)
ChemLLM: A Chemical Large Language Model [49.308528569982805]
Large language models (LLMs) have made impressive progress in chemistry applications. However, the community lacks an LLM specifically designed for chemistry. Here, we introduce ChemLLM, a comprehensive framework that features the first LLM dedicated to chemistry.
arXiv Detail & Related papers (2024-02-10T01:11:59Z)
Large Language Model Distilling Medication Recommendation Model [61.89754499292561]
We harness the powerful semantic comprehension and input-agnostic characteristics of Large Language Models (LLMs) Our research aims to transform existing medication recommendation methodologies using LLMs. To mitigate this, we have developed a feature-level knowledge distillation technique, which transfers the LLM's proficiency to a more compact model.
arXiv Detail & Related papers (2024-02-05T08:25:22Z)
ChemDFM: A Large Language Foundation Model for Chemistry [27.864255196445324]
A more generic and efficient solution would be an AI model that could address many tasks and support free-form dialogue in the broad field of chemistry. We develop ChemDFM, a pioneering LLM for chemistry trained on 34B tokens from chemical literature and textbooks, and fine-tuned using 2.7M instructions. We have open-sourced the inference codes, evaluation datasets, and model weights of ChemDFM on Huggingface.
arXiv Detail & Related papers (2024-01-26T12:45:55Z)
Structured Chemistry Reasoning with Large Language Models [70.13959639460015]
Large Language Models (LLMs) excel in diverse areas, yet struggle with complex scientific reasoning, especially in chemistry. We introduce StructChem, a simple yet effective prompting strategy that offers the desired guidance and substantially boosts the LLMs' chemical reasoning capability. Tests across four chemistry areas -- quantum chemistry, mechanics, physical chemistry, and kinetics -- StructChem substantially enhances GPT-4's performance, with up to 30% peak improvement.
arXiv Detail & Related papers (2023-11-16T08:20:36Z)
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks [41.9830989458936]
Large Language Models (LLMs) with strong abilities in natural language processing tasks have emerged. We aim to evaluate capabilities of LLMs in a wide range of tasks across the chemistry domain.
arXiv Detail & Related papers (2023-05-27T14:17:33Z)
ChemCrow: Augmenting large-language models with chemistry tools [0.9195187117013247]
Large-language models (LLMs) have shown strong performance in tasks across domains, but struggle with chemistry-related problems. In this study, we introduce ChemCrow, an LLM chemistry agent designed to accomplish tasks across organic synthesis, drug discovery, and materials design. Our agent autonomously planned and executed the syntheses of an insect repellent, three organocatalysts, and guided the discovery of a novel chromophore.
arXiv Detail & Related papers (2023-04-11T17:41:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.