NaFM: Pre-training a Foundation Model for Small-Molecule Natural Products
- URL: http://arxiv.org/abs/2503.17656v1
- Date: Sat, 22 Mar 2025 05:32:03 GMT
- Title: NaFM: Pre-training a Foundation Model for Small-Molecule Natural Products
- Authors: Yuheng Ding, Yusong Wang, Bo Qiang, Jie Yu, Qi Li, Yiran Zhou, Zhenmin Liu,
- Abstract summary: Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities.<n>Existing deep learning methods for natural products research rely on supervised learning approaches designed for specific downstream tasks.<n>We have pre-trained a foundation model for natural products based on their unique properties.<n>Our framework achieves state-of-the-art (SOTA) results in various downstream tasks related to natural product mining and drug discovery.
- Score: 7.124182654659631
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural products, as metabolites from microorganisms, animals, or plants, exhibit diverse biological activities, making them crucial for drug discovery. Nowadays, existing deep learning methods for natural products research primarily rely on supervised learning approaches designed for specific downstream tasks. However, such one-model-for-a-task paradigm often lacks generalizability and leaves significant room for performance improvement. Additionally, existing molecular characterization methods are not well-suited for the unique tasks associated with natural products. To address these limitations, we have pre-trained a foundation model for natural products based on their unique properties. Our approach employs a novel pretraining strategy that is especially tailored to natural products. By incorporating contrastive learning and masked graph learning objectives, we emphasize evolutional information from molecular scaffolds while capturing side-chain information. Our framework achieves state-of-the-art (SOTA) results in various downstream tasks related to natural product mining and drug discovery. We first compare taxonomy classification with synthesized molecule-focused baselines to demonstrate that current models are inadequate for understanding natural synthesis. Furthermore, by diving into a fine-grained analysis at both the gene and microbial levels, NaFM demonstrates the ability to capture evolutionary information. Eventually, our method is experimented with virtual screening, illustrating informative natural product representations that can lead to more effective identification of potential drug candidates.
Related papers
- Nature Language Model: Deciphering the Language of Nature for Scientific Discovery [105.55751854768297]
Foundation models have revolutionized natural language processing and artificial intelligence.<n>We introduce Nature Language Model (NatureLM), a sequence-based science foundation model for scientific discovery.
arXiv Detail & Related papers (2025-02-11T13:08:03Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - Leveraging Biomolecule and Natural Language through Multi-Modal
Learning: A Survey [75.47055414002571]
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology.
We provide an analysis of recent advancements achieved through cross modeling of biomolecules and natural language.
arXiv Detail & Related papers (2024-03-03T14:59:47Z) - Integrating Chemical Language and Molecular Graph in Multimodal Fused Deep Learning for Drug Property Prediction [9.388979080270103]
We construct multimodal deep learning models to cover different molecular representations.
Compared with mono-modal models, our multimodal fused deep learning (MMFDL) models outperform single models in accuracy, reliability, and resistance capability against noise.
arXiv Detail & Related papers (2023-12-29T07:19:42Z) - Interactive Molecular Discovery with Natural Language [69.89287960545903]
We propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules.
To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information.
arXiv Detail & Related papers (2023-06-21T02:05:48Z) - Bidirectional Generation of Structure and Properties Through a Single
Molecular Foundation Model [44.60174246341653]
We present a novel multimodal molecular pre-trained model that incorporates the modalities of structure and biochemical properties.
Our proposed model pipeline of data handling and training objectives aligns the structure/property features in a common embedding space.
These contributions emerge synergistic knowledge, allowing us to tackle both multimodal and unimodal downstream tasks through a single model.
arXiv Detail & Related papers (2022-11-19T05:16:08Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - Retrieval-based Controllable Molecule Generation [63.44583084888342]
We propose a new retrieval-based framework for controllable molecule generation.
We use a small set of molecules to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria.
Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning.
arXiv Detail & Related papers (2022-08-23T17:01:16Z) - A biologically-inspired evaluation of molecular generative machine
learning [17.623886600638716]
A novel biologically-inspired benchmark for the evaluation of molecular generative models is proposed.
We propose a recreation metric, apply drug-target affinity prediction and molecular docking as complementary techniques for the evaluation of generative outputs.
arXiv Detail & Related papers (2022-08-20T11:01:10Z) - Analysis of training and seed bias in small molecules generated with a
conditional graph-based variational autoencoder -- Insights for practical
AI-driven molecule generation [0.0]
We analyze the impact of seed and training bias on the output of an activity-conditioned graph-based variational autoencoder (VAE)
Our graph-based generative model is shown to excel in producing desired conditioned activities and favorable unconditioned physical properties in generated molecules.
arXiv Detail & Related papers (2021-07-19T16:00:05Z) - Evolution Is All You Need: Phylogenetic Augmentation for Contrastive
Learning [1.7188280334580197]
Self-supervised representation learning of biological sequence embeddings alleviates computational resource constraints on downstream tasks.
We show that contrastive learning using evolutionary phylogenetic augmentation can be used as a representation learning objective.
arXiv Detail & Related papers (2020-12-25T01:35:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.