KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge
- URL: http://arxiv.org/abs/2510.19484v1
- Date: Wed, 22 Oct 2025 11:23:58 GMT
- Title: KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge
- Authors: Zaifei Yang, Hong Chang, Ruibing Hou, Shiguang Shan, Xilin Chen,
- Abstract summary: We introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels.<n>We also propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies.<n>KnowMol achieves superior performance across molecular understanding and generation tasks.
- Score: 73.51130155601824
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks. GitHub: https://github.com/yzf-code/KnowMol Huggingface: https://hf.co/datasets/yzf1102/KnowMol-100K
Related papers
- $\ ext{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models [59.125833618091846]
We propose a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view.<n>Experiments demonstrate that $textM2$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks.
arXiv Detail & Related papers (2025-08-12T05:46:47Z) - Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model [52.84455878597969]
Mol-LLaMA is a large molecular language model that grasps the general knowledge centered on molecules.<n>To improve molecular understanding, we propose a module that integrates complementary information from different molecular encoders.
arXiv Detail & Related papers (2025-02-19T05:49:10Z) - UniMoT: Unified Molecule-Text Language Model with Discrete Token Representation [35.277927005912275]
We introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture.<n>A Vector Quantization-driven tokenizer transforms molecules into sequences of molecule tokens with causal dependency.<n>UniMoT emerges as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks.
arXiv Detail & Related papers (2024-08-01T18:31:31Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - GIT-Mol: A Multi-modal Large Language Model for Molecular Science with
Graph, Image, and Text [25.979382232281786]
We introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information.
We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity.
arXiv Detail & Related papers (2023-08-14T03:12:29Z) - Interactive Molecular Discovery with Natural Language [69.89287960545903]
We propose the conversational molecular design, a novel task adopting natural language for describing and editing target molecules.
To better accomplish this task, we design ChatMol, a knowledgeable and versatile generative pre-trained model, enhanced by injecting experimental property information.
arXiv Detail & Related papers (2023-06-21T02:05:48Z) - Domain-Agnostic Molecular Generation with Chemical Feedback [44.063584808910896]
MolGen is a pre-trained molecular language model tailored specifically for molecule generation.
It internalizes structural and grammatical insights through the reconstruction of over 100 million molecular SELFIES.
Our chemical feedback paradigm steers the model away from molecular hallucinations, ensuring alignment between the model's estimated probabilities and real-world chemical preferences.
arXiv Detail & Related papers (2023-01-26T17:52:56Z) - A Molecular Multimodal Foundation Model Associating Molecule Graphs with
Natural Language [63.60376252491507]
We propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data.
We believe that our model would have a broad impact on AI-empowered fields across disciplines such as biology, chemistry, materials, environment, and medicine.
arXiv Detail & Related papers (2022-09-12T00:56:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.