Scientific Language Modeling: A Quantitative Review of Large Language
Models in Molecular Science
- URL: http://arxiv.org/abs/2402.04119v1
- Date: Tue, 6 Feb 2024 16:12:36 GMT
- Title: Scientific Language Modeling: A Quantitative Review of Large Language
Models in Molecular Science
- Authors: Pengfei Liu, Jun Tao, Zhixiang Ren
- Abstract summary: Large language models (LLMs) offer a fresh approach to tackle scientific problems from a natural language processing (NLP) perspective.
We propose a multi-modal benchmark, named ChEBI-20-MM, and perform 1263 experiments to assess the model's compatibility with data modalities and knowledge acquisition.
Our pioneering analysis offers an exploration of the learning mechanism and paves the way for advancing SLM in molecular science.
- Score: 27.874571056109758
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficient molecular modeling and design are crucial for the discovery and
exploration of novel molecules, and the incorporation of deep learning methods
has revolutionized this field. In particular, large language models (LLMs)
offer a fresh approach to tackle scientific problems from a natural language
processing (NLP) perspective, introducing a research paradigm called scientific
language modeling (SLM). However, two key issues remain: how to quantify the
match between model and data modalities and how to identify the
knowledge-learning preferences of models. To address these challenges, we
propose a multi-modal benchmark, named ChEBI-20-MM, and perform 1263
experiments to assess the model's compatibility with data modalities and
knowledge acquisition. Through the modal transition probability matrix, we
provide insights into the most suitable modalities for tasks. Furthermore, we
introduce a statistically interpretable approach to discover context-specific
knowledge mapping by localized feature filtering. Our pioneering analysis
offers an exploration of the learning mechanism and paves the way for advancing
SLM in molecular science.
Related papers
- Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model [55.87790704067848]
Mol-LLaMA is a large molecular language model that grasps the general knowledge centered on molecules via multi-modal instruction tuning.
To improve understanding of molecular features, we introduce a module that integrates complementary information from different molecular encoders.
arXiv Detail & Related papers (2025-02-19T05:49:10Z) - MolCap-Arena: A Comprehensive Captioning Benchmark on Language-Enhanced Molecular Property Prediction [44.27112553103388]
We present Molecule Caption Arena: the first comprehensive benchmark of large language models (LLMs)augmented molecular property prediction.
We evaluate over twenty LLMs, including both general-purpose and domain-specific molecule captioners, across diverse prediction tasks.
Our findings confirm the ability of LLM-extracted knowledge to enhance state-of-the-art molecular representations.
arXiv Detail & Related papers (2024-11-01T17:03:16Z) - Explainable AI Methods for Multi-Omics Analysis: A Survey [3.885941688264509]
Multi-omics refers to the integrative analysis of data derived from multiple 'omes'
Deep learning methods are increasingly utilized to integrate multi-omics data, offering insights into molecular interactions and enhancing research into complex diseases.
These models, with their numerous interconnected layers and nonlinear relationships, often function as black boxes, lacking transparency in decision-making processes.
This review explores how xAI can improve the interpretability of deep learning models in multi-omics research, highlighting its potential to provide clinicians with clear insights.
arXiv Detail & Related papers (2024-10-15T05:01:17Z) - Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models [12.744381867301353]
We propose a novel Molecular Graph representation learning framework that integrates Large language models and Domain-specific small models.
We employ a multi-modal alignment method to coordinate various modalities, including molecular graphs and their corresponding descriptive texts, to guide the pre-training of molecular representations.
arXiv Detail & Related papers (2024-08-19T16:11:59Z) - Many-Shot In-Context Learning for Molecular Inverse Design [56.65345962071059]
Large Language Models (LLMs) have demonstrated great performance in few-shot In-Context Learning (ICL)
We develop a new semi-supervised learning method that overcomes the lack of experimental data available for many-shot ICL.
As we show, the new method greatly improves upon existing ICL methods for molecular design while being accessible and easy to use for scientists.
arXiv Detail & Related papers (2024-07-26T21:10:50Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - MolTC: Towards Molecular Relational Modeling In Language Models [28.960416816491392]
We propose a novel framework for Molecular inTeraction prediction following Chain-of-Thought (CoT) theory termed MolTC.
Our experiments, conducted across various datasets involving over 4,000,000 molecular pairs, exhibit the superiority of our method over current GNN and LLM-based baselines.
arXiv Detail & Related papers (2024-02-06T07:51:56Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - Knowledge-informed Molecular Learning: A Survey on Paradigm Transfer [20.893861195128643]
Machine learning, notably deep learning, has significantly propelled molecular investigations within the biochemical sphere.
Traditionally, modeling for such research has centered around a handful of paradigms.
To enhance the generation and decipherability of purely data-driven models, scholars have integrated biochemical domain knowledge into these molecular study models.
arXiv Detail & Related papers (2022-02-17T06:18:02Z) - Learning Neural Generative Dynamics for Molecular Conformation
Generation [89.03173504444415]
We study how to generate molecule conformations (textiti.e., 3D structures) from a molecular graph.
We propose a novel probabilistic framework to generate valid and diverse conformations given a molecular graph.
arXiv Detail & Related papers (2021-02-20T03:17:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.