BioMedGPT-Mol: Multi-task Learning for Molecular Understanding and Generation
- URL: http://arxiv.org/abs/2512.04629v2
- Date: Thu, 11 Dec 2025 08:24:32 GMT
- Title: BioMedGPT-Mol: Multi-task Learning for Molecular Understanding and Generation
- Authors: Chenyang Zuo, Siqi Fan, Zaiqing Nie,
- Abstract summary: We introduce BioMedGPT-Mol, a molecular language model designed to support molecular understanding and generation tasks.<n>By curating and unifying existing public instruction datasets, we have assembled a large-scale, comprehensive, and high-quality training dataset.<n>The model is then fine-tuned through a meticulously designed multi-task learning framework.
- Score: 9.078742514163524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Molecules play a crucial role in biomedical research and discovery, particularly in the field of small molecule drug development. Given the rapid advancements in large language models, especially the recent emergence of reasoning models, it is natural to explore how a general-purpose language model can be efficiently adapted for molecular science applications. In this work, we introduce BioMedGPT-Mol, a molecular language model designed to support molecular understanding and generation tasks. By curating and unifying existing public instruction datasets, we have assembled a large-scale, comprehensive, and high-quality training dataset. The model is then fine-tuned through a meticulously designed multi-task learning framework. On a consolidated benchmark derived from LlaSMol, TOMG-Bench, and MuMOInstruct, BioMedGPT-Mol achieves remarkable performance. Our experimental results demonstrate that a general-purpose reasoning model can be effectively and efficiently post-trained into a professional molecular language model through a well-structured multi-task curriculum. Leveraging these capabilities, we further apply the model to multi-step retrosynthetic planning, achieving state-of-the-art performance on RetroBench and demonstrating its superior efficacy as an end-to-end retrosynthetic planner. We anticipate that our approach can be extended to other biomedical scientific domains.
Related papers
- MMAI Gym for Science: Training Liquid Foundation Models for Drug Discovery [41.21168385964764]
MMAI Gym is a one-stop shop molecular data formats and modalities as well as task-specific reasoning, training, and benchmarking recipes.<n>We use MMAI Gym to train an efficient Liquid Foundation Model (LFM) for these applications, demonstrating that smaller, purpose-trained foundation models can outperform substantially larger general-purpose or specialist models on molecular benchmarks.
arXiv Detail & Related papers (2026-03-03T20:51:51Z) - Agentic reinforcement learning empowers next-generation chemical language models for molecular design and synthesis [51.83339196548892]
ChemCraft is a novel framework that decouples chemical reasoning from knowledge storage.<n>ChemCraft achieves superior performance with minimal inference costs.<n>This work establishes a cost-effective and privacy-preserving paradigm for AI-aided chemistry.
arXiv Detail & Related papers (2026-01-25T04:23:34Z) - NovoMolGen: Rethinking Molecular Language Model Pretraining [14.403924658046806]
We introduce NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules for de-novo molecule generation.<n>Through extensive empirical analyses, we identify a weak correlation between performance metrics measured during pretraining and actual downstream performance.<n>NovoMolGen establishes new state-of-the-art results, substantially outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecular generation tasks.
arXiv Detail & Related papers (2025-08-19T00:04:48Z) - $\text{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models [59.125833618091846]
We propose a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view.<n>Experiments demonstrate that $textM2$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks.
arXiv Detail & Related papers (2025-08-12T05:46:47Z) - PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation [4.402280157389038]
We propose PharMolixFM, a unified framework for constructing all-atom foundation models.<n>Our framework includes three variants using state-of-the-art multi-modal generative models.<n>PharMolixFM-Diff achieves competitive prediction accuracy in protein-small-molecule docking.
arXiv Detail & Related papers (2025-03-12T12:53:43Z) - ExLLM: Experience-Enhanced LLM Optimization for Molecular Design and Beyond [16.374785306736474]
We introduce ExLLM (Experience-Enhanced LLM optimization), an LLM-as-optimizer framework with three components.<n>ExLLM sets new state-of-the-art results on PMO and generalizes strongly in our setup.
arXiv Detail & Related papers (2025-02-18T13:25:00Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - Leveraging Biomolecule and Natural Language through Multi-Modal
Learning: A Survey [75.47055414002571]
The integration of biomolecular modeling with natural language (BL) has emerged as a promising interdisciplinary area at the intersection of artificial intelligence, chemistry and biology.
We provide an analysis of recent advancements achieved through cross modeling of biomolecules and natural language.
arXiv Detail & Related papers (2024-03-03T14:59:47Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for
Large Language Models [44.41299105569085]
Mol-Instructions is a comprehensive instruction dataset designed for the biomolecular domain.
Each component aims to improve the understanding and prediction capabilities of LLMs concerning biomolecular features and behaviors.
We demonstrate the effectiveness of Mol-Instructions in enhancing large models' performance in the intricate realm of biomolecular studies.
arXiv Detail & Related papers (2023-06-13T14:35:34Z) - Domain-Agnostic Molecular Generation with Chemical Feedback [44.063584808910896]
MolGen is a pre-trained molecular language model tailored specifically for molecule generation.
It internalizes structural and grammatical insights through the reconstruction of over 100 million molecular SELFIES.
Our chemical feedback paradigm steers the model away from molecular hallucinations, ensuring alignment between the model's estimated probabilities and real-world chemical preferences.
arXiv Detail & Related papers (2023-01-26T17:52:56Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.