Towards Foundational Models for Molecular Learning on Large-Scale
Multi-Task Datasets
- URL: http://arxiv.org/abs/2310.04292v3
- Date: Wed, 18 Oct 2023 11:06:43 GMT
- Title: Towards Foundational Models for Molecular Learning on Large-Scale
Multi-Task Datasets
- Authors: Dominique Beaini, Shenyang Huang, Joao Alex Cunha, Zhiyi Li, Gabriela
Moisescu-Pareja, Oleksandr Dymov, Samuel Maddrell-Mander, Callum McLean,
Frederik Wenkel, Luis M\"uller, Jama Hussein Mohamud, Ali Parviz, Michael
Craig, Micha{\l} Koziarski, Jiarui Lu, Zhaocheng Zhu, Cristian Gabellini,
Kerstin Klaser, Josef Dean, Cas Wognum, Maciej Sypetkowski, Guillaume
Rabusseau, Reihaneh Rabbany, Jian Tang, Christopher Morris, Ioannis Koutis,
Mirco Ravanelli, Guy Wolf, Prudencio Tossou, Hadrien Mary, Therence Bois,
Andrew Fitzgibbon, B{\l}a\.zej Banaszewski, Chad Martin, Dominic Masters
- Abstract summary: We present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge.
These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning.
In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library.
- Score: 42.401713168958445
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recently, pre-trained foundation models have enabled significant advancements
in multiple fields. In molecular machine learning, however, where datasets are
often hand-curated, and hence typically small, the lack of datasets with
labeled features, and codebases to manage those datasets, has hindered the
development of foundation models. In this work, we present seven novel datasets
categorized by size into three distinct categories: ToyMix, LargeMix and
UltraLarge. These datasets push the boundaries in both the scale and the
diversity of supervised labels for molecular learning. They cover nearly 100
million molecules and over 3000 sparsely defined tasks, totaling more than 13
billion individual labels of both quantum and biological nature. In comparison,
our datasets contain 300 times more data points than the widely used OGB-LSC
PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In
addition, to support the development of foundational models based on our
proposed datasets, we present the Graphium graph machine learning library which
simplifies the process of building and training molecular machine learning
models for multi-task and multi-level molecular datasets. Finally, we present a
range of baseline results as a starting point of multi-task and multi-level
training on these datasets. Empirically, we observe that performance on
low-resource biological datasets show improvement by also training on large
amounts of quantum data. This indicates that there may be potential in
multi-task and multi-level training of a foundation model and fine-tuning it to
resource-constrained downstream tasks.
Related papers
- Physical Consistency Bridges Heterogeneous Data in Molecular Multi-Task Learning [79.75718786477638]
We exploit the specialty of molecular tasks that there are physical laws connecting them, and design consistency training approaches.
We demonstrate that the more accurate energy data can improve the accuracy of structure prediction.
We also find that consistency training can directly leverage force and off-equilibrium structure data to improve structure prediction.
arXiv Detail & Related papers (2024-10-14T03:11:33Z) - A Large Encoder-Decoder Family of Foundation Models For Chemical Language [1.1073864511426255]
This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem.
Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks.
arXiv Detail & Related papers (2024-07-24T20:30:39Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - Multi-Task Self-Training for Learning General Representations [97.01728635294879]
Multi-task self-training (MuST) harnesses the knowledge in independent specialized teacher models to train a single general student model.
MuST is scalable with unlabeled or partially labeled datasets and outperforms both specialized supervised models and self-supervised models when training on large scale datasets.
arXiv Detail & Related papers (2021-08-25T17:20:50Z) - OmiEmbed: reconstruct comprehensive phenotypic information from
multi-omics data using multi-task deep learning [19.889861433855053]
High-dimensional omics data contains intrinsic biomedical information crucial for personalised medicine.
It is challenging to capture them from genome-wide data due to the large number of molecular features and small number of available samples.
We proposed a unified multi-task deep learning framework called OmiEmbed to capture a holistic and relatively precise profile of phenotype from high-dimensional omics data.
arXiv Detail & Related papers (2021-02-03T07:34:29Z) - MELINDA: A Multimodal Dataset for Biomedical Experiment Method
Classification [14.820951153262685]
We introduce a new dataset, MELINDA, for Multimodal biomEdicaL experImeNt methoD clAssification.
The dataset is collected in a fully automated distant supervision manner, where the labels are obtained from an existing curated database.
We benchmark various state-of-the-art NLP and computer vision models, including unimodal models which only take either caption texts or images as inputs.
arXiv Detail & Related papers (2020-12-16T19:11:36Z) - Polymer Informatics with Multi-Task Learning [0.06524460254566902]
We show the potency of multi-task learning approaches that exploit inherent correlations effectively.
Data pertaining to 36 different properties of over $13, 000$ polymers are coalesced and supplied to deep-learning multi-task architectures.
The multi-task approach is accurate, efficient, scalable, and amenable to transfer learning as more data on the same or different properties become available.
arXiv Detail & Related papers (2020-10-28T18:28:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.