MMAI Gym for Science: Training Liquid Foundation Models for Drug Discovery
- URL: http://arxiv.org/abs/2603.03517v1
- Date: Tue, 03 Mar 2026 20:51:51 GMT
- Title: MMAI Gym for Science: Training Liquid Foundation Models for Drug Discovery
- Authors: Maksim Kuznetsov, Zulfat Miftahutdinov, Rim Shayakhmetov, Mikolaj Mizera, Roman Schutski, Bogdan Zagribelnyy, Ivan Ilin, Nikita Bondarev, Thomas MacDougall, Mathieu Reymond, Mihir Bafna, Kaeli Kaymak-Loveless, Eugene Babin, Maxim Malkov, Mathias Lechner, Ramin Hasani, Alexander Amini, Vladimir Aladinskiy, Alex Aliper, Alex Zhavoronkov,
- Abstract summary: MMAI Gym is a one-stop shop molecular data formats and modalities as well as task-specific reasoning, training, and benchmarking recipes.<n>We use MMAI Gym to train an efficient Liquid Foundation Model (LFM) for these applications, demonstrating that smaller, purpose-trained foundation models can outperform substantially larger general-purpose or specialist models on molecular benchmarks.
- Score: 41.21168385964764
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: General-purpose large language models (LLMs) that rely on in-context learning do not reliably deliver the scientific understanding and performance required for drug discovery tasks. Simply increasing model size or introducing reasoning tokens does not yield significant performance gains. To address this gap, we introduce the MMAI Gym for Science, a one-stop shop molecular data formats and modalities as well as task-specific reasoning, training, and benchmarking recipes designed to teach foundation models the 'language of molecules' in order to solve practical drug discovery problems. We use MMAI Gym to train an efficient Liquid Foundation Model (LFM) for these applications, demonstrating that smaller, purpose-trained foundation models can outperform substantially larger general-purpose or specialist models on molecular benchmarks. Across essential drug discovery tasks - including molecular optimization, ADMET property prediction, retrosynthesis, drug-target activity prediction, and functional group reasoning - the resulting model achieves near specialist-level performance and, in the majority of settings, surpasses larger models, while remaining more efficient and broadly applicable in the domain.
Related papers
- Agentic reinforcement learning empowers next-generation chemical language models for molecular design and synthesis [51.83339196548892]
ChemCraft is a novel framework that decouples chemical reasoning from knowledge storage.<n>ChemCraft achieves superior performance with minimal inference costs.<n>This work establishes a cost-effective and privacy-preserving paradigm for AI-aided chemistry.
arXiv Detail & Related papers (2026-01-25T04:23:34Z) - BioMedGPT-Mol: Multi-task Learning for Molecular Understanding and Generation [9.078742514163524]
We introduce BioMedGPT-Mol, a molecular language model designed to support molecular understanding and generation tasks.<n>By curating and unifying existing public instruction datasets, we have assembled a large-scale, comprehensive, and high-quality training dataset.<n>The model is then fine-tuned through a meticulously designed multi-task learning framework.
arXiv Detail & Related papers (2025-12-04T10:00:16Z) - Foundation Models for Discovery and Exploration in Chemical Space [57.97784111110166]
MIST is a family of molecular foundation models trained on large unlabeled datasets.<n>We demonstrate the ability of these models to solve real-world problems across chemical space.
arXiv Detail & Related papers (2025-10-20T17:56:01Z) - Reasoning-Enhanced Large Language Models for Molecular Property Prediction [19.593493317167646]
Molecular property prediction is crucial for drug discovery and materials science.<n>Existing approaches suffer from limited interpretability, poor cross-task generalization, and lack of chemical reasoning capabilities.<n>We propose MPPReasoner, a multimodal large language model that incorporates chemical reasoning for molecular property prediction.
arXiv Detail & Related papers (2025-10-11T15:05:45Z) - NovoMolGen: Rethinking Molecular Language Model Pretraining [14.403924658046806]
We introduce NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules for de-novo molecule generation.<n>Through extensive empirical analyses, we identify a weak correlation between performance metrics measured during pretraining and actual downstream performance.<n>NovoMolGen establishes new state-of-the-art results, substantially outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecular generation tasks.
arXiv Detail & Related papers (2025-08-19T00:04:48Z) - $\text{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models [59.125833618091846]
We propose a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view.<n>Experiments demonstrate that $textM2$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks.
arXiv Detail & Related papers (2025-08-12T05:46:47Z) - ExLLM: Experience-Enhanced LLM Optimization for Molecular Design and Beyond [16.374785306736474]
We introduce ExLLM (Experience-Enhanced LLM optimization), an LLM-as-optimizer framework with three components.<n>ExLLM sets new state-of-the-art results on PMO and generalizes strongly in our setup.
arXiv Detail & Related papers (2025-02-18T13:25:00Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - The Role of Model Architecture and Scale in Predicting Molecular Properties: Insights from Fine-Tuning RoBERTa, BART, and LLaMA [0.0]
This study introduces a systematic framework to compare the efficacy of Large Language Models (LLMs) for fine-tuning across various cheminformatics tasks.
We assessed three well-known models-RoBERTa, BART, and LLaMA-on their ability to predict molecular properties.
We found that LLaMA-based models generally offered the lowest validation loss, suggesting their superior adaptability across tasks and scales.
arXiv Detail & Related papers (2024-05-02T02:20:12Z) - MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures [2.5563339057415218]
MolIG is a novel MultiModaL molecular pre-training framework for predicting molecular properties based on Image and Graph structures.
It amalgamates the strengths of both molecular representation forms.
It exhibits enhanced performance in downstream tasks pertaining to molecular property prediction within benchmark groups.
arXiv Detail & Related papers (2023-11-28T10:28:35Z) - MolCPT: Molecule Continuous Prompt Tuning to Generalize Molecular
Representation Learning [77.31492888819935]
We propose a novel paradigm of "pre-train, prompt, fine-tune" for molecular representation learning, named molecule continuous prompt tuning (MolCPT)
MolCPT defines a motif prompting function that uses the pre-trained model to project the standalone input into an expressive prompt.
Experiments on several benchmark datasets show that MolCPT efficiently generalizes pre-trained GNNs for molecular property prediction.
arXiv Detail & Related papers (2022-12-20T19:32:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.