BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery
- URL: http://arxiv.org/abs/2411.10548v1
- Date: Fri, 15 Nov 2024 19:46:16 GMT
- Title: BioNeMo Framework: a modular, high-performance library for AI model development in drug discovery
- Authors: Peter St. John, Dejun Lin, Polina Binder, Malcolm Greaves, Vega Shah, John St. John, Adrian Lange, Patrick Hsu, Rajesh Illango, Arvind Ramanathan, Anima Anandkumar, David H Brookes, Akosua Busia, Abhishaike Mahajan, Stephen Malina, Neha Prasad, Sam Sinai, Lindsay Edwards, Thomas Gaudelet, Cristian Regep, Martin Steinegger, Burkhard Rost, Alexander Brace, Kyle Hippe, Luca Naef, Keisuke Kamata, George Armstrong, Kevin Boyd, Zhonglin Cao, Han-Yi Chou, Simon Chu, Allan dos Santos Costa, Sajad Darabi, Eric Dawson, Kieran Didi, Cong Fu, Mario Geiger, Michelle Gill, Darren Hsu, Gagan Kaushik, Maria Korshunova, Steven Kothen-Hill, Youhan Lee, Meng Liu, Micha Livne, Zachary McClure, Jonathan Mitchell, Alireza Moradzadeh, Ohad Mosafi, Youssef Nashed, Saee Paliwal, Yuxing Peng, Sara Rabhi, Farhad Ramezanghorbani, Danny Reidenbach, Camir Ricketts, Brian Roland, Kushal Shah, Tyler Shimko, Hassan Sirelkhatim, Savitha Srinivasan, Abraham C Stern, Dorota Toczydlowska, Srimukh Prasad Veccham, Niccolò Alberto Elia Venanzi, Anton Vorontsov, Jared Wilber, Isabel Wilkinson, Wei Jing Wong, Eva Xue, Cory Ye, Xin Yu, Yang Zhang, Guoqing Zhou, Becca Zandstein, Christian Dallago, Bruno Trentini, Emine Kucukbenli, Saee Paliwal, Timur Rvachov, Eddie Calleja, Johnny Israeli, Harry Clifford, Risto Haukioja, Nicholas Haemel, Kyle Tretina, Neha Tadimeti, Anthony B Costa,
- Abstract summary: We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models.
On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days.
The BioNeMo Framework is open-source and free for everyone to use.
- Score: 66.97700597098215
- License:
- Abstract: Artificial Intelligence models encoding biology and chemistry are opening new routes to high-throughput and high-quality in-silico drug development. However, their training increasingly relies on computational scale, with recent protein language models (pLM) training on hundreds of graphical processing units (GPUs). We introduce the BioNeMo Framework to facilitate the training of computational biology and chemistry AI models across hundreds of GPUs. Its modular design allows the integration of individual components, such as data loaders, into existing workflows and is open to community contributions. We detail technical features of the BioNeMo Framework through use cases such as pLM pre-training and fine-tuning. On 256 NVIDIA A100s, BioNeMo Framework trains a three billion parameter BERT-based pLM on over one trillion tokens in 4.2 days. The BioNeMo Framework is open-source and free for everyone to use.
Related papers
- Bio2Token: All-atom tokenization of any biomolecular structure with Mamba [3.039173168183899]
We develop quantized auto-encoders that learn atom-level tokenizations of complete proteins, RNA and small molecule structures with reconstruction accuracies below and around 1 Angstrom.
We demonstrate that the Mamba state space model architecture employed is comparatively efficient, requiring a fraction of the training data, parameters and compute needed to reach competitive accuracies and can scale to systems with almost 100,000 atoms.
arXiv Detail & Related papers (2024-10-24T19:23:09Z) - All-in-one foundational models learning across quantum chemical levels [0.0]
We introduce the all-in-one (AIO) ANI model architecture based on multimodal learning.
Our all-in-one learning approach offers a more general and easier-to-use alternative to transfer learning.
We show that the AIO-ANI model can learn across different QC levels ranging from semi-empirical to density functional theory to coupled cluster.
arXiv Detail & Related papers (2024-09-18T14:29:14Z) - GP-MoLFormer: A Foundation Model For Molecular Generation [31.569161097828893]
We extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work.
Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1B chemical SMILES.
We find GP-MoLFormer is able to generate a significant fraction of novel, valid, and unique SMILES even when the number of generated molecules is in the 10 billion range and the reference set is over a billion.
arXiv Detail & Related papers (2024-04-04T16:20:06Z) - MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki [46.62437145754009]
We present the MAMMOTH toolkit, a framework for training massively multilingual modular machine translation systems at scale.
We showcase its efficiency across clusters of A100 and V100 NVIDIA GPUs, and discuss our design philosophy and plans for future information.
arXiv Detail & Related papers (2024-03-12T11:32:30Z) - OpenMEDLab: An Open-source Platform for Multi-modality Foundation Models
in Medicine [55.29668193415034]
We present OpenMEDLab, an open-source platform for multi-modality foundation models.
It encapsulates solutions of pioneering attempts in prompting and fine-tuning large language and vision models for frontline clinical and bioinformatic applications.
It opens access to a group of pre-trained foundation models for various medical image modalities, clinical text, protein engineering, etc.
arXiv Detail & Related papers (2024-02-28T03:51:02Z) - Model LEGO: Creating Models Like Disassembling and Assembling Building Blocks [53.09649785009528]
In this paper, we explore a paradigm that does not require training to obtain new models.
Similar to the birth of CNN inspired by receptive fields in the biological visual system, we propose Model Disassembling and Assembling.
For model assembling, we present the alignment padding strategy and parameter scaling strategy to construct a new model tailored for a specific task.
arXiv Detail & Related papers (2022-03-25T05:27:28Z) - The Ecological Footprint of Neural Machine Translation Systems [2.132096006921048]
This chapter focuses on the ecological footprint of neural MT systems.
It starts from the power drain during the training of and the inference with neural MT models and moves towards the environment impact.
The overall CO2 offload is calculated for Ireland and the Netherlands.
arXiv Detail & Related papers (2022-02-04T14:56:41Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Scientific Language Models for Biomedical Knowledge Base Completion: An
Empirical Study [62.376800537374024]
We study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction.
We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance.
arXiv Detail & Related papers (2021-06-17T17:55:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.