Related papers: Can Large Language Models Empower Molecular Property Prediction?

Can Large Language Models Empower Molecular Property Prediction?

URL: http://arxiv.org/abs/2307.07443v1
Date: Fri, 14 Jul 2023 16:06:42 GMT
Title: Can Large Language Models Empower Molecular Property Prediction?
Authors: Chen Qian, Huayi Tang, Zhirui Yang, Hong Liang, Yong Liu
Abstract summary: Molecular property prediction has gained significant attention due to its transformative potential in scientific disciplines. Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP. In this work, we advance towards this objective through two perspectives: zero/few-shot molecular classification, and using the new explanations generated by LLMs as representations of molecules.
Score: 16.5246941211725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Molecular property prediction has gained significant attention due to its transformative potential in multiple scientific disciplines. Conventionally, a molecule graph can be represented either as a graph-structured data or a SMILES text. Recently, the rapid development of Large Language Models (LLMs) has revolutionized the field of NLP. Although it is natural to utilize LLMs to assist in understanding molecules represented by SMILES, the exploration of how LLMs will impact molecular property prediction is still in its early stage. In this work, we advance towards this objective through two perspectives: zero/few-shot molecular classification, and using the new explanations generated by LLMs as representations of molecules. To be specific, we first prompt LLMs to do in-context molecular classification and evaluate their performance. After that, we employ LLMs to generate semantically enriched explanations for the original SMILES and then leverage that to fine-tune a small-scale LM model for multiple downstream tasks. The experimental results highlight the superiority of text explanations as molecular representations across multiple benchmark datasets, and confirm the immense potential of LLMs in molecular property prediction tasks. Codes are available at \url{https://github.com/ChnQ/LLM4Mol}.

Related papers

How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning? [51.286853421822705]
Large language models (LLMs) have shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear.<n>We introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures.<n>Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions.
arXiv Detail & Related papers (2026-01-09T20:08:42Z)
$\ ext{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models [59.125833618091846]
We propose a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view.<n>Experiments demonstrate that $textM2$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks.
arXiv Detail & Related papers (2025-08-12T05:46:47Z)
Improving Chemical Understanding of LLMs via SMILES Parsing [18.532188836688928]
CLEANMOL is a novel framework that formulates SMILES parsing into a suite of clean and deterministic tasks.<n>We construct a molecular pretraining dataset with adaptive difficulty scoring and pre-train open-source LLMs on these tasks.<n>Our results show that CLEANMOL not only enhances structural comprehension but also achieves the best or competes with the baseline on the Mol-Instructions benchmark.
arXiv Detail & Related papers (2025-05-22T07:54:39Z)
Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model [55.87790704067848]
Mol-LLaMA is a large molecular language model that grasps the general knowledge centered on molecules.<n>To improve molecular understanding, we propose a module that integrates complementary information from different molecular encoders.
arXiv Detail & Related papers (2025-02-19T05:49:10Z)
Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization [8.846705148987652]
Large Language Models (LLMs) have motivated the development of general LLMs for molecular tasks. LLMs trained with naive next-token prediction training assign similar likelihood scores to both original and corrupted molecules. We introduce a novel multi-modal training method based on a thorough multi-modal instruction tuning and a molecular structure preference optimization.
arXiv Detail & Related papers (2025-02-05T01:14:12Z)
Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation [26.166926881479316]
We propose Speak-to-Structure (S2-Bench), the first benchmark to evaluate Large Language Models (LLMs) in open-domain natural language-driven molecule generation.<n>Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom)<n>We also introduce OpenMolIns, a large-scale instruction tuning dataset that enables Llama-3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S2-Bench.
arXiv Detail & Related papers (2024-12-19T08:51:16Z)
Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z)
MolCap-Arena: A Comprehensive Captioning Benchmark on Language-Enhanced Molecular Property Prediction [44.27112553103388]
We present Molecule Caption Arena: the first comprehensive benchmark of large language models (LLMs)augmented molecular property prediction. We evaluate over twenty LLMs, including both general-purpose and domain-specific molecule captioners, across diverse prediction tasks. Our findings confirm the ability of LLM-extracted knowledge to enhance state-of-the-art molecular representations.
arXiv Detail & Related papers (2024-11-01T17:03:16Z)
FARM: Functional Group-Aware Representations for Small Molecules [55.281754551202326]
We introduce Functional Group-Aware Representations for Small Molecules (FARM) FARM is a foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. We rigorously evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 10 out of 12 tasks.
arXiv Detail & Related papers (2024-10-02T23:04:58Z)
MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension [34.586861881519134]
Large Language Models (LLMs) with their strong task-handling capabilities have shown remarkable advancements across a spectrum of fields. This study seeks to enhance the ability of LLMs to comprehend molecules by equipping them with a multi-modal external module, namely MolX. In particular, instead of directly using a SMILES string to represent a molecule, we utilize specific encoders to extract fine-grained features from both SMILES string and 2D molecular graph representations.
arXiv Detail & Related papers (2024-06-10T20:25:18Z)
Benchmarking Large Language Models for Molecule Prediction Tasks [7.067145619709089]
Large Language Models (LLMs) stand at the forefront of a number of Natural Language Processing (NLP) tasks. This paper explores a fundamental question: Can LLMs effectively handle molecule prediction tasks? We identify several classification and regression prediction tasks across six standard molecule datasets. We compare their performance with existing Machine Learning (ML) models, which include text-based models and those specifically designed for analysing the geometric structure of molecules.
arXiv Detail & Related papers (2024-03-08T05:59:56Z)
Rethinking Interpretability in the Era of Large Language Models [76.1947554386879]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks. The capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human. These new capabilities raise new challenges, such as hallucinated explanations and immense computational costs.
arXiv Detail & Related papers (2024-01-30T17:38:54Z)
Can Large Language Models Understand Molecules? [0.0699049312989311]
We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks. We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks.
arXiv Detail & Related papers (2024-01-05T18:31:34Z)
Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks. In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation. We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z)
MolXPT: Wrapping Molecules with Text for Generative Pre-training [141.0924452870112]
MolXPT is a unified language model of text and molecules pre-trained on SMILES wrapped by text. MolXPT outperforms strong baselines of molecular property prediction on MoleculeNet.
arXiv Detail & Related papers (2023-05-18T03:58:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.