From Tokens to Blocks: A Block-Diffusion Perspective on Molecular Generation
- URL: http://arxiv.org/abs/2601.21964v2
- Date: Fri, 30 Jan 2026 05:42:43 GMT
- Title: From Tokens to Blocks: A Block-Diffusion Perspective on Molecular Generation
- Authors: Qianwei Yang, Dong Xu, Zhangfan Yang, Sisi Yuan, Zexuan Zhu, Jianqiang Li, Junkai Ji,
- Abstract summary: GPT-based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large-scale data.<n>Here, we propose SoftMol, a unified framework that co-designs molecular representation, model architecture, and search strategy for target-aware generation.<n>SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2-3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency.
- Score: 17.14830371749135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Drug discovery can be viewed as a combinatorial search over an immense chemical space, motivating the development of deep generative models for de novo molecular design. Among these, GPT-based molecular language models (MLM) have shown strong molecular design performance by learning chemical syntax and semantics from large-scale data. However, existing MLMs face two fundamental limitations: they inadequately capture the graph-structured nature of molecules when formulated as next-token prediction problems, and they typically lack explicit mechanisms for target-aware generation. Here, we propose SoftMol, a unified framework that co-designs molecular representation, model architecture, and search strategy for target-aware molecular generation. SoftMol introduces soft fragments, a rule-free block representation of SMILES that enables diffusion-native modeling, and develops SoftBD, the first block-diffusion molecular language model that combines local bidirectional diffusion with autoregressive generation under molecular structural constraints. To favor generated molecules with high drug-likeness and synthetic accessibility, SoftBD is trained on a carefully curated dataset named ZINC-Curated. SoftMol further integrates a gated Monte Carlo tree search to assemble fragments in a target-aware manner. Experimental results show that, compared with current state-of-the-art models, SoftMol achieves 100% chemical validity, improves binding affinity by 9.7%, yields a 2-3x increase in molecular diversity, and delivers a 6.6x speedup in inference efficiency. Code is available at https://github.com/szu-aicourse/softmol
Related papers
- MolHIT: Advancing Molecular-Graph Generation with Hierarchical Discrete Diffusion Models [37.89307688620534]
We introduce MolHIT, a powerful molecular graph generation framework that overcomes long-standing performance limitations in existing methods.<n>Overall, MolHIT achieves new state-of-the-art performance on the MOSES dataset with near-perfect validity for the first time in graph diffusion.
arXiv Detail & Related papers (2026-02-19T18:27:11Z) - Improving Large Molecular Language Model via Relation-aware Multimodal Collaboration [34.099746438477816]
We propose CoLLaMo, a large language model-based molecular assistant equipped with a multi-level molecular modality-collaborative projector.<n>Our experiments demonstrate that our CoLLaMo enhances the molecular modality generalization capabilities of LMLMs.
arXiv Detail & Related papers (2026-01-18T04:38:19Z) - KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical Knowledge [73.51130155601824]
We introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels.<n>We also propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies.<n>KnowMol achieves superior performance across molecular understanding and generation tasks.
arXiv Detail & Related papers (2025-10-22T11:23:58Z) - $\text{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models [59.125833618091846]
We propose a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view.<n>Experiments demonstrate that $textM2$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks.
arXiv Detail & Related papers (2025-08-12T05:46:47Z) - FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching [6.401101865760261]
We introduce FragFM, a novel hierarchical framework via fragment-level discrete flow matching for efficient molecular graph generation.<n>FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoder to reconstruct details at the atom level.<n>We also propose a Natural Product Generation benchmark to evaluate modern molecular graph generative models' ability to generate natural product-like molecules.
arXiv Detail & Related papers (2025-02-19T07:01:00Z) - FARM: Functional Group-Aware Representations for Small Molecules [55.281754551202326]
We introduce Functional Group-Aware Representations for Small Molecules (FARM)<n>FARM is a novel model designed to bridge the gap between SMILES, natural language, and molecular graphs.<n>We evaluate FARM on the MoleculeNet dataset, where it achieves state-of-the-art performance on 11 out of 13 tasks.
arXiv Detail & Related papers (2024-10-02T23:04:58Z) - LDMol: A Text-to-Molecule Diffusion Model with Structurally Informative Latent Space Surpasses AR Models [55.5427001668863]
We present a novel latent diffusion model dubbed LDMol for text-conditioned molecule generation.<n> Experiments show that LDMol outperforms the existing autoregressive baselines on the text-to-molecule generation benchmark.<n>We show that LDMol can be applied to downstream tasks such as molecule-to-text retrieval and text-guided molecule editing.
arXiv Detail & Related papers (2024-05-28T04:59:13Z) - Data-Efficient Molecular Generation with Hierarchical Textual Inversion [48.816943690420224]
We introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method.
HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution.
Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution.
arXiv Detail & Related papers (2024-05-05T08:35:23Z) - MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures [2.5563339057415218]
MolIG is a novel MultiModaL molecular pre-training framework for predicting molecular properties based on Image and Graph structures.
It amalgamates the strengths of both molecular representation forms.
It exhibits enhanced performance in downstream tasks pertaining to molecular property prediction within benchmark groups.
arXiv Detail & Related papers (2023-11-28T10:28:35Z) - Domain-Agnostic Molecular Generation with Chemical Feedback [44.063584808910896]
MolGen is a pre-trained molecular language model tailored specifically for molecule generation.
It internalizes structural and grammatical insights through the reconstruction of over 100 million molecular SELFIES.
Our chemical feedback paradigm steers the model away from molecular hallucinations, ensuring alignment between the model's estimated probabilities and real-world chemical preferences.
arXiv Detail & Related papers (2023-01-26T17:52:56Z) - MIMOSA: Multi-constraint Molecule Sampling for Molecule Optimization [51.00815310242277]
generative models and reinforcement learning approaches made initial success, but still face difficulties in simultaneously optimizing multiple drug properties.
We propose the MultI-constraint MOlecule SAmpling (MIMOSA) approach, a sampling framework to use input molecule as an initial guess and sample molecules from the target distribution.
arXiv Detail & Related papers (2020-10-05T20:18:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.