Related papers: SiDGen: Structure-informed Diffusion for Generative modeling of Ligands for Proteins

SiDGen: Structure-informed Diffusion for Generative modeling of Ligands for Proteins

URL: http://arxiv.org/abs/2511.09529v1
Date: Thu, 13 Nov 2025 02:00:48 GMT
Title: SiDGen: Structure-informed Diffusion for Generative modeling of Ligands for Proteins
Authors: Samyak Sanghvi, Nishant Ranjan, Tarak Karmakar,
Abstract summary: We present SiDGen, a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness.<n>SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling.<n>In automated benchmarks, SiDGen generates with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Designing ligands that are both chemically valid and structurally compatible with protein binding pockets is a key bottleneck in computational drug discovery. Existing approaches either ignore structural context or rely on expensive, memory-intensive encoding that limits throughput and scalability. We present SiDGen (Structure-informed Diffusion Generator), a protein-conditioned diffusion framework that integrates masked SMILES generation with lightweight folding-derived features for pocket awareness. To balance expressivity with efficiency, SiDGen supports two conditioning pathways: a streamlined mode that pools coarse structural signals from protein embeddings and a full mode that injects localized pairwise biases for stronger coupling. A coarse-stride folding mechanism with nearest-neighbor upsampling alleviates the quadratic memory costs of pair tensors, enabling training on realistic sequence lengths. Learning stability is maintained through in-loop chemical validity checks and an invalidity penalty, while large-scale training efficiency is restored \textit{via} selective compilation, dataloader tuning, and gradient accumulation. In automated benchmarks, SiDGen generates ligands with high validity, uniqueness, and novelty, while achieving competitive performance in docking-based evaluations and maintaining reasonable molecular properties. These results demonstrate that SiDGen can deliver scalable, pocket-aware molecular design, providing a practical route to conditional generation for high-throughput drug discovery.

Related papers

SaDiT: Efficient Protein Backbone Design via Latent Structural Tokenization and Diffusion Transformers [50.18388227899971]
We present SaDiT, a novel framework that accelerates protein backbone generation by integrating SaProt Tokenization with a Diffusion Transformer (DiT) architecture.<n>Experiments demonstrate that SaDiT outperforms state-of-the-art models, including RFDiffusion and Proteina, in both computational speed and structural viability.
arXiv Detail & Related papers (2026-02-06T13:50:13Z)
Edge-aware GAT-based protein binding site prediction [3.3941174310007685]
We propose an Edge-aware Graph Attention Network (Edge-aware GAT) model for the fine-grained prediction of binding sites across biomolecules.<n>Our method constructs atom-level graphs and integrates multidimensional structural features, including geometric descriptors.<n>Our model achieves an ROC-AUC of 0.93 for protein-protein binding site prediction, outperforming several state-of-the-art methods.
arXiv Detail & Related papers (2026-01-05T14:09:57Z)
S$^2$Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening [72.89086338778098]
We propose a two-stage framework for protein-ligand contrastive representation learning.<n>In the first stage, we perform protein sequence pretraining on ChemBL using an ESM2-based backbone.<n>In the second stage, we fine-tune on PDBBind by fusing sequence and structure information through a residue-level gating module.<n>This auxiliary task guides the model to accurately localize binding residues within the protein sequence and capture their 3D spatial arrangement.
arXiv Detail & Related papers (2025-11-10T11:57:47Z)
A Novel Framework for Multi-Modal Protein Representation Learning [13.33566214386641]
We propose Diffused and Aligned Multi-modal Protein Embedding (DAMPE), a unified framework that addresses two core mechanisms.<n>First, we propose Optimal Transport (OT)-based representation alignment that establishes correspondence between intrinsic embedding spaces of different modalities.<n>Second, we develop a Conditional Graph Generation (CGG)-based information fusion method, where a condition encoder fuses the aligned intrinsic embeddings to provide informative cues for graph reconstruction.
arXiv Detail & Related papers (2025-10-27T12:33:01Z)
ProteinAE: Protein Diffusion Autoencoders for Structure Encoding [64.77182442408254]
We introduce ProteinAE, a novel and streamlined protein diffusion autoencoder.<n>ProteinAE directly maps protein backbone coordinates from E(3) into a continuous, compact latent space.<n>We demonstrate that ProteinAE achieves state-of-the-art reconstruction quality, outperforming existing autoencoders.
arXiv Detail & Related papers (2025-10-12T14:30:32Z)
NIRVANA: Structured pruning reimagined for large language models compression [50.651730342011014]
We introduce NIRVANA, a novel pruning method designed to balance immediate zero-shot preservation accuracy with robust fine-tuning.<n>To further address the unique challenges posed by structured pruning, NIRVANA incorporates an adaptive sparsity allocation mechanism across layers and modules.<n>Experiments conducted on Llama3, Qwen, T5 models demonstrate that NIRVANA outperforms existing structured pruning methods under equivalent sparsity constraints.
arXiv Detail & Related papers (2025-09-17T17:59:00Z)
ReDiSC: A Reparameterized Masked Diffusion Model for Scalable Node Classification with Structured Predictions [64.17845687013434]
We propose ReDiSC, a structured diffusion model for structured node classification.<n>We show that ReDiSC achieves superior or highly competitive performance compared to state-of-the-art GNN, label propagation, and diffusion-based baselines.<n> Notably, ReDiSC scales effectively to large-scale datasets on which previous structured diffusion methods fail due to computational constraints.
arXiv Detail & Related papers (2025-07-19T04:46:53Z)
Reimagining Target-Aware Molecular Generation through Retrieval-Enhanced Aligned Diffusion [22.204642926984526]
READ is introduced, which is the first to merge molecular Retrieval-Augmented Generation with an SE(3)-equivariant diffusion model.<n>It can achieve very competitive performance in CBGBench, surpassing state-of-the-art generative models and even native scaffolds.
arXiv Detail & Related papers (2025-06-17T13:09:11Z)
Energy-Based Coarse-Graining in Molecular Dynamics: A Flow-Based Framework without Data [0.0]
Coarse-grained (CG) models provide an effective route to reducing the complexity of molecular simulations.<n>We introduce a fully data-free, generative framework for CG that directly targets the all-atom Boltzmann distribution.<n>We show that the method captures all relevant modes of the Boltzmann distribution, reconstructs atomic configurations, and automatically learns physically meaningful CG representations.
arXiv Detail & Related papers (2025-04-29T17:05:27Z)
Fast and Accurate Blind Flexible Docking [79.88520988144442]
Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets plays a vital role in drug discovery.<n>We propose FABFlex, a fast and accurate regression-based multi-task learning model designed for realistic blind flexible docking scenarios.
arXiv Detail & Related papers (2025-02-20T07:31:13Z)
The Latent Road to Atoms: Backmapping Coarse-grained Protein Structures with Latent Diffusion [19.85659309869674]
Latent Diffusion Backmapping (LDB) is a novel approach leveraging denoising diffusion within latent space to address these challenges. We evaluate LDB's state-of-the-art performance on three distinct protein datasets. Our results position LDB as a powerful and scalable approach for backmapping, effectively bridging the gap between CG simulations and atomic-level analyses in computational biology.
arXiv Detail & Related papers (2024-10-17T06:38:07Z)
Protein Design with Guided Discrete Diffusion [67.06148688398677]
A popular approach to protein design is to combine a generative model with a discriminative model for conditional sampling. We propose diffusioN Optimized Sampling (NOS), a guidance method for discrete diffusion models. NOS makes it possible to perform design directly in sequence space, circumventing significant limitations of structure-based methods.
arXiv Detail & Related papers (2023-05-31T16:31:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.