NatureLM: Deciphering the Language of Nature for Scientific Discovery
- URL: http://arxiv.org/abs/2502.07527v1
- Date: Tue, 11 Feb 2025 13:08:03 GMT
- Title: NatureLM: Deciphering the Language of Nature for Scientific Discovery
- Authors: Yingce Xia, Peiran Jin, Shufang Xie, Liang He, Chuan Cao, Renqian Luo, Guoqing Liu, Yue Wang, Zequn Liu, Yuan-Jyue Chen, Zekun Guo, Yeqi Bai, Pan Deng, Yaosen Min, Ziheng Lu, Hongxia Hao, Han Yang, Jielan Li, Chang Liu, Jia Zhang, Jianwei Zhu, Kehan Wu, Wei Zhang, Kaiyuan Gao, Qizhi Pei, Qian Wang, Xixian Liu, Yanting Li, Houtian Zhu, Yeqing Lu, Mingqian Ma, Zun Wang, Tian Xie, Krzysztof Maziarz, Marwin Segler, Zhao Yang, Zilong Chen, Yu Shi, Shuxin Zheng, Lijun Wu, Chen Hu, Peggy Dai, Tie-Yan Liu, Haiguang Liu, Tao Qin,
- Abstract summary: Foundation models have revolutionized natural language processing and artificial intelligence.
We introduce Nature Language Model (briefly, NatureLM), a sequence-based science foundation model for scientific discovery.
- Score: 105.57567762153462
- License:
- Abstract: Foundation models have revolutionized natural language processing and artificial intelligence, significantly enhancing how machines comprehend and generate human languages. Inspired by the success of these foundation models, researchers have developed foundation models for individual scientific domains, including small molecules, materials, proteins, DNA, and RNA. However, these models are typically trained in isolation, lacking the ability to integrate across different scientific domains. Recognizing that entities within these domains can all be represented as sequences, which together form the "language of nature", we introduce Nature Language Model (briefly, NatureLM), a sequence-based science foundation model designed for scientific discovery. Pre-trained with data from multiple scientific domains, NatureLM offers a unified, versatile model that enables various applications including: (i) generating and optimizing small molecules, proteins, RNA, and materials using text instructions; (ii) cross-domain generation/design, such as protein-to-molecule and protein-to-RNA generation; and (iii) achieving state-of-the-art performance in tasks like SMILES-to-IUPAC translation and retrosynthesis on USPTO-50k. NatureLM offers a promising generalist approach for various scientific tasks, including drug discovery (hit generation/optimization, ADMET optimization, synthesis), novel material design, and the development of therapeutic proteins or nucleotides. We have developed NatureLM models in different sizes (1 billion, 8 billion, and 46.7 billion parameters) and observed a clear improvement in performance as the model size increases.
Related papers
- MOLLM: Multi-Objective Large Language Model for Molecular Design -- Optimizing with Experts [3.9194654197529784]
Molecular design plays a critical role in advancing fields such as drug discovery, materials science, and chemical engineering.
This work introduces the Multi-Objective Large Language Model for Molecular Design (MOLLM), a novel framework that combines domain-specific knowledge with the adaptability of Large Language Models.
arXiv Detail & Related papers (2025-02-18T13:25:00Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of promoter sequences.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - InstructBioMol: Advancing Biomolecule Understanding and Design Following Human Instructions [32.38318676313486]
InstructBioMol is designed to bridge natural language and biomolecules.
It can integrate multimodal biomolecules as input, and enable researchers to articulate design goals in natural language.
It can generate drug molecules with a 10% improvement in binding affinity and design enzymes that achieve an ESP Score of 70.4.
arXiv Detail & Related papers (2024-10-10T13:45:56Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - Leveraging Biomolecule and Natural Language through Multi-Modal
Learning: A Survey [75.47055414002571]
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology.
We provide an analysis of recent advancements achieved through cross modeling of biomolecules and natural language.
arXiv Detail & Related papers (2024-03-03T14:59:47Z) - nach0: Multimodal Natural and Chemical Languages Foundation Model [7.815497069231599]
This paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks.
nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings.
arXiv Detail & Related papers (2023-11-21T07:56:30Z) - DARWIN Series: Domain Specific Large Language Models for Natural Science [20.864698325126735]
We present DARWIN, a series of tailored LLMs for natural science, mainly in physics, chemistry, and material science.
We fine-tuned the models using over 60,000 instruction data points, emphasizing factual correctness.
DARWIN series not only achieves state-of-the-art results on various scientific tasks but also diminishes reliance on closed-source AI models.
arXiv Detail & Related papers (2023-08-25T01:40:48Z) - Interactive Molecular Discovery with Natural Language [69.89287960545903]
We propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules.
To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information.
arXiv Detail & Related papers (2023-06-21T02:05:48Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.