Multimodal machine learning with large language embedding model for polymer property prediction
- URL: http://arxiv.org/abs/2503.22962v1
- Date: Sat, 29 Mar 2025 03:48:11 GMT
- Title: Multimodal machine learning with large language embedding model for polymer property prediction
- Authors: Tianren Zhang, Dai-Bei Yang,
- Abstract summary: We propose a simple yet effective multimodal architecture, PolyLLMem, for polymer properties prediction tasks.<n>PolyLLMem integrates text embeddings generated by Llama 3 with molecular structure embeddings derived from Uni-Mol.<n>Its performance is comparable to, and in some cases exceeds, that of graph-based models, as well as transformer-based models.
- Score: 2.525624865489335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary large language models (LLMs), such as GPT-4 and Llama, have harnessed extensive computational power and diverse text corpora to achieve remarkable proficiency in interpreting and generating domain-specific content, including materials science. To leverage the domain knowledge embedded within these models, we propose a simple yet effective multimodal architecture, PolyLLMem, which integrates text embeddings generated by Llama 3 with molecular structure embeddings derived from Uni-Mol, for polymer properties prediction tasks. In our model, Low-rank adaptation (LoRA) layers were also incorporated during the property prediction tasks to refine the embeddings based on our limited polymer dataset, thereby enhancing their chemical relevance for polymer SMILES representation. This balanced fusion of fine-tuned textual and structural information enables PolyLLMem to accurately predict a variety of polymer properties despite the scarcity of training data. Its performance is comparable to, and in some cases exceeds, that of graph-based models, as well as transformer-based models that typically require pretraining on millions of polymer samples. These findings demonstrate that LLM, such as Llama, can effectively capture chemical information encoded in polymer PSMILES, and underscore the efficacy of multimodal fusion of LLM embeddings and molecular structure embeddings in overcoming data scarcity and accelerating the discovery of advanced polymeric materials.
Related papers
- POINT$^{2}$: A Polymer Informatics Training and Testing Database [15.45788515943579]
POINT$2$ (POlymer INformatics Training and Testing) is a benchmark database and protocol designed to address critical challenges in polymer informatics.<n>We develop an ensemble of ML models, including Quantile Random Forests, Multilayer Perceptrons with dropout, Graph Neural Networks, and pretrained large language models.<n>These models are coupled with diverse polymer representations such as Morgan, MACCS, RDKit, Topological, Atom Pair fingerprints, and graph-based descriptors.
arXiv Detail & Related papers (2025-03-30T15:46:01Z) - MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science [62.96434290874878]
Current Multi-Modal Large Language Models (MLLM) have shown strong capabilities in general visual reasoning tasks.<n>We develop a new framework, named Multi-Modal Scientific Reasoning with Physics Perception and Simulation (MAPS) based on an MLLM.<n>MAPS decomposes expert-level multi-modal reasoning task into physical diagram understanding via a Physical Perception Model (PPM) and reasoning with physical knowledge via a simulator.
arXiv Detail & Related papers (2025-01-18T13:54:00Z) - Molecular topological deep learning for polymer property prediction [18.602659324026934]
We develop molecular topological deep learning (Mol-TDL) for polymer property analysis.
Mol-TDL incorporates both high-order interactions and multiscale properties into topological deep learning architecture.
arXiv Detail & Related papers (2024-10-07T05:44:02Z) - Many-Shot In-Context Learning for Molecular Inverse Design [56.65345962071059]
Large Language Models (LLMs) have demonstrated great performance in few-shot In-Context Learning (ICL)
We develop a new semi-supervised learning method that overcomes the lack of experimental data available for many-shot ICL.
As we show, the new method greatly improves upon existing ICL methods for molecular design while being accessible and easy to use for scientists.
arXiv Detail & Related papers (2024-07-26T21:10:50Z) - MMPolymer: A Multimodal Multitask Pretraining Framework for Polymer Property Prediction [24.975491375575224]
MMPolymer is a novel multitask pretraining framework incorporating polymer 1D sequential and 3D structural information.
MMPolymer achieves state-of-the-art performance in downstream property prediction tasks.
arXiv Detail & Related papers (2024-06-07T08:19:59Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - Compositional Representation of Polymorphic Crystalline Materials [56.80318252233511]
We introduce PCRL, a novel approach that employs probabilistic modeling of composition to capture the diverse polymorphs from available structural information.
Extensive evaluations on sixteen datasets demonstrate the effectiveness of PCRL in learning compositional representation.
arXiv Detail & Related papers (2023-11-17T20:34:28Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - TransPolymer: a Transformer-based language model for polymer property
predictions [9.04563945965023]
TransPolymer is a Transformer-based language model for polymer property prediction.
Our proposed polymer tokenizer with chemical awareness enables learning representations from polymer sequences.
arXiv Detail & Related papers (2022-09-03T01:29:59Z) - BIGDML: Towards Exact Machine Learning Force Fields for Materials [55.944221055171276]
Machine-learning force fields (MLFF) should be accurate, computationally and data efficient, and applicable to molecules, materials, and interfaces thereof.
Here, we introduce the Bravais-Inspired Gradient-Domain Machine Learning approach and demonstrate its ability to construct reliable force fields using a training set with just 10-200 atoms.
arXiv Detail & Related papers (2021-06-08T10:14:57Z) - Copolymer Informatics with Multi-Task Deep Neural Networks [0.0]
We address the property prediction challenge for copolymers, extending the polymer informatics framework beyond homopolymers.
A large data set containing over 18,000 data points of glass transition, melting, and degradation temperature of homopolymers and copolymers of up to two monomers is used.
The developed models are accurate, fast, flexible, and scalable to more copolymer properties when suitable data become available.
arXiv Detail & Related papers (2021-03-25T23:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.