nach0: Multimodal Natural and Chemical Languages Foundation Model
- URL: http://arxiv.org/abs/2311.12410v3
- Date: Thu, 2 May 2024 09:12:12 GMT
- Title: nach0: Multimodal Natural and Chemical Languages Foundation Model
- Authors: Micha Livne, Zulfat Miftahutdinov, Elena Tutubalina, Maksim Kuznetsov, Daniil Polykovskiy, Annika Brundyn, Aastha Jhunjhunwala, Anthony Costa, Alex Aliper, Alán Aspuru-Guzik, Alex Zhavoronkov,
- Abstract summary: This paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks.
nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings.
- Score: 7.815497069231599
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.
Related papers
- NatureLM: Deciphering the Language of Nature for Scientific Discovery [105.57567762153462]
Foundation models have revolutionized natural language processing and artificial intelligence.
We introduce Nature Language Model (briefly, NatureLM), a sequence-based science foundation model for scientific discovery.
arXiv Detail & Related papers (2025-02-11T13:08:03Z) - Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset.
This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks.
We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z) - Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models [43.37148291436855]
We present a two-step framework PEIT to improve large language models for molecular-related tasks.
In the first step, we use textual descriptions, SMILES, and biochemical properties as multimodal inputs to pre-train a model called PEIT-GEN.
In the second step, we fine-tune existing open-source LLMs with the synthesized data, the resulting PEIT-LLM can handle molecule captioning, text-based molecule generation, molecular property prediction, and our newly proposed multi-constraint molecule generation tasks.
arXiv Detail & Related papers (2024-12-24T01:48:07Z) - Exploring the Benefits of Domain-Pretraining of Generative Large Language Models for Chemistry [5.4665365335928024]
We investigate the trade-offs of leveraging off-the-shelf versus more targeted foundation models for scientific domains.
In this work, we examine the benefits of in-domain pre-training for a given scientific domain, chemistry, and compare these to open-source, off-the-shelf models with zero-shot and few-shot prompting.
Our results show that not only do in-domain base models perform reasonably well on in-domain tasks in a zero-shot setting but that further adaptation using instruction fine-tuning yields impressive performance on chemistry-specific tasks.
arXiv Detail & Related papers (2024-11-05T22:45:10Z) - Crossing New Frontiers: Knowledge-Augmented Large Language Model Prompting for Zero-Shot Text-Based De Novo Molecule Design [0.0]
Our study explores the use of knowledge-augmented prompting of large language models (LLMs) for the zero-shot text-conditional de novo molecular generation task.
Our framework proves effective, outperforming state-of-the-art (SOTA) baseline models on benchmark datasets.
arXiv Detail & Related papers (2024-08-18T11:37:19Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - Leveraging Biomolecule and Natural Language through Multi-Modal
Learning: A Survey [75.47055414002571]
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology.
We provide an analysis of recent advancements achieved through cross modeling of biomolecules and natural language.
arXiv Detail & Related papers (2024-03-03T14:59:47Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z) - Unifying Molecular and Textual Representations via Multi-task Language
Modelling [11.474894472719543]
We propose the first multi-domain, multi-task language model that can solve a wide range of tasks in both the chemical and natural language domains.
Our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models.
Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences.
arXiv Detail & Related papers (2023-01-29T23:56:45Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.