nach0: Multimodal Natural and Chemical Languages Foundation Model
- URL: http://arxiv.org/abs/2311.12410v3
- Date: Thu, 2 May 2024 09:12:12 GMT
- Title: nach0: Multimodal Natural and Chemical Languages Foundation Model
- Authors: Micha Livne, Zulfat Miftahutdinov, Elena Tutubalina, Maksim Kuznetsov, Daniil Polykovskiy, Annika Brundyn, Aastha Jhunjhunwala, Anthony Costa, Alex Aliper, Alán Aspuru-Guzik, Alex Zhavoronkov,
- Abstract summary: This paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks.
nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings.
- Score: 7.815497069231599
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.
Related papers
- Exploring the Benefits of Domain-Pretraining of Generative Large Language Models for Chemistry [5.4665365335928024]
We investigate the trade-offs of leveraging off-the-shelf versus more targeted foundation models for scientific domains.
In this work, we examine the benefits of in-domain pre-training for a given scientific domain, chemistry, and compare these to open-source, off-the-shelf models with zero-shot and few-shot prompting.
Our results show that not only do in-domain base models perform reasonably well on in-domain tasks in a zero-shot setting but that further adaptation using instruction fine-tuning yields impressive performance on chemistry-specific tasks.
arXiv Detail & Related papers (2024-11-05T22:45:10Z) - MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language [0.24434823694833652]
MAMMAL is a versatile multi-task multi-align foundation model that learns from large-scale biological datasets.
We introduce a prompt syntax that supports a wide range of classification, regression, and generation tasks.
We evaluate the model on 11 diverse downstream tasks spanning different steps within a typical drug discovery pipeline.
arXiv Detail & Related papers (2024-10-28T20:45:52Z) - nach0-pc: Multi-task Language Model with Molecular Point Cloud Encoder [8.125243669908318]
nach0-pc is a model combining domain-specific encoder and textual representation to handle spatial arrangement of atoms effectively.
We introduce a novel pre-training scheme for molecular point clouds to distillate the knowledge from spatial molecular structures datasets.
arXiv Detail & Related papers (2024-10-11T20:30:44Z) - Crossing New Frontiers: Knowledge-Augmented Large Language Model Prompting for Zero-Shot Text-Based De Novo Molecule Design [0.0]
Our study explores the use of knowledge-augmented prompting of large language models (LLMs) for the zero-shot text-conditional de novo molecular generation task.
Our framework proves effective, outperforming state-of-the-art (SOTA) baseline models on benchmark datasets.
arXiv Detail & Related papers (2024-08-18T11:37:19Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - Leveraging Biomolecule and Natural Language through Multi-Modal
Learning: A Survey [75.47055414002571]
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology.
We provide an analysis of recent advancements achieved through cross modeling of biomolecules and natural language.
arXiv Detail & Related papers (2024-03-03T14:59:47Z) - Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective [53.300288393173204]
Large Language Models (LLMs) have shown remarkable performance in various cross-modal tasks.
In this work, we propose an In-context Few-Shot Molecule Learning paradigm for molecule-caption translation.
We evaluate the effectiveness of MolReGPT on molecule-caption translation, including molecule understanding and text-based molecule generation.
arXiv Detail & Related papers (2023-06-11T08:16:25Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - Unifying Molecular and Textual Representations via Multi-task Language
Modelling [11.474894472719543]
We propose the first multi-domain, multi-task language model that can solve a wide range of tasks in both the chemical and natural language domains.
Our model can handle chemical and natural language concurrently, without requiring expensive pre-training on single domains or task-specific models.
Our work suggests that such models can robustly and efficiently accelerate discovery in physical sciences.
arXiv Detail & Related papers (2023-01-29T23:56:45Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z) - Reinforced Molecular Optimization with Neighborhood-Controlled Grammars [63.84003497770347]
We propose MNCE-RL, a graph convolutional policy network for molecular optimization.
We extend the original neighborhood-controlled embedding grammars to make them applicable to molecular graph generation.
We show that our approach achieves state-of-the-art performance in a diverse range of molecular optimization tasks.
arXiv Detail & Related papers (2020-11-14T05:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.