IBEX: Information-Bottleneck-EXplored Coarse-to-Fine Molecular Generation under Limited Data
- URL: http://arxiv.org/abs/2508.10775v1
- Date: Thu, 14 Aug 2025 15:59:22 GMT
- Title: IBEX: Information-Bottleneck-EXplored Coarse-to-Fine Molecular Generation under Limited Data
- Authors: Dong Xu, Zhangfan Yang, Jenna Xinyi Yao, Shuangbao Song, Zexuan Zhu, Junkai Ji,
- Abstract summary: Three-dimensional generative models increasingly drive structure-based drug discovery, yet it remains constrained by the scarce publicly available protein-ligand complexes.<n>We present IBEX, an Information-Bottleneck-EXplored coarse-to-fine pipeline to tackle the chronic shortage of protein-ligand complex data in structure-based drug design.
- Score: 18.780698265706945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Three-dimensional generative models increasingly drive structure-based drug discovery, yet it remains constrained by the scarce publicly available protein-ligand complexes. Under such data scarcity, almost all existing pipelines struggle to learn transferable geometric priors and consequently overfit to training-set biases. As such, we present IBEX, an Information-Bottleneck-EXplored coarse-to-fine pipeline to tackle the chronic shortage of protein-ligand complex data in structure-based drug design. Specifically, we use PAC-Bayesian information-bottleneck theory to quantify the information density of each sample. This analysis reveals how different masking strategies affect generalization and indicates that, compared with conventional de novo generation, the constrained Scaffold Hopping task endows the model with greater effective capacity and improved transfer performance. IBEX retains the original TargetDiff architecture and hyperparameters for training to generate molecules compatible with the binding pocket; it then applies an L-BFGS optimization step to finely refine each conformation by optimizing five physics-based terms and adjusting six translational and rotational degrees of freedom in under one second. With only these modifications, IBEX raises the zero-shot docking success rate on CBGBench CrossDocked2020-based from 53% to 64%, improves the mean Vina score from $-7.41 kcal mol^{-1}$ to $-8.07 kcal mol^{-1}$, and achieves the best median Vina energy in 57 of 100 pockets versus 3 for the original TargetDiff. IBEX also increases the QED by 25%, achieves state-of-the-art validity and diversity, and markedly reduces extrapolation error.
Related papers
- Hybrid Gated Flow (HGF): Stabilizing 1.58-bit LLMs via Selective Low-Rank Correction [0.766310831583367]
Hybrid Gated Flow (HGF) is a dual-stream architecture that couples a 1.58-bit ternary backbone with a learnable, low-rank FP16 correction path.<n>We show that HGF 5.4 achieves a validation loss of 0.9306 compared to BitNet's 1.0294, recovering approximately 55% of the quality gap between pure ternary quantization and the FP16 baseline.
arXiv Detail & Related papers (2026-02-05T03:47:17Z) - Physics Enhanced Deep Surrogates for the Phonon Boltzmann Transport Equation [0.0]
Physics-Enhanced Deep Surrogate (PEDS)<n>Network learns geometry-dependent corrections and a mixing coefficient that interpolates between macroscopic and nano-scale behavior.<n>PEDS reduces training-data requirements by up to 70% compared with purely data-driven baselines.
arXiv Detail & Related papers (2025-11-25T16:25:24Z) - Pretrained Joint Predictions for Scalable Batch Bayesian Optimization of Molecular Designs [1.3505106522886807]
We show how to obtain scalable probabilistic surrogates of binding affinity for use in Batch Bayesian Optimization.<n>This demands parallel acquisition functions that hedge between designs and the ability to rapidly sample from a joint predictive density to approximate them.<n>Key to this work is an investigation into the importance of prior networks in ENNs and how to pretrain them on synthetic data to improve downstream performance.
arXiv Detail & Related papers (2025-11-13T18:26:58Z) - Pearl: A Foundation Model for Placing Every Atom in the Right Location [52.35027831422145]
We introduce Pearl, a foundation model for protein-ligand cofolding at scale.<n>Pearl establishes a new state-of-the-art performance in protein-ligand cofolding.<n>Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N' Poses and PoseBusters benchmarks.
arXiv Detail & Related papers (2025-10-28T17:36:51Z) - Muon: Training and Trade-offs with Latent Attention and MoE [4.500362688166346]
We present a comprehensive theoretical and empirical study of the Muon for training transformers only with a small to medium decoder (30M - 200M parameters)<n>We provide rigorous theoretical analysis including: (i)showing the convergence rate under standard assumptions, (ii) spectral regularization properties that prevent gradient explosion, (iii) connection to natural gradient descent on the Stiefel manifold, and (iv) equivalence to steepest gradient descent under the spectral norm.
arXiv Detail & Related papers (2025-09-29T07:51:06Z) - ResCap-DBP: A Lightweight Residual-Capsule Network for Accurate DNA-Binding Protein Prediction Using Global ProteinBERT Embeddings [9.626183317998143]
We propose a novel deep learning framework, ResCap-DBP, that combines a residual learning-based encoder with a one-dimensional Capsule Network.<n>ProteinBERT embeddings substantially outperform other representations on large datasets.<n>Our model consistently outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2025-07-27T21:54:32Z) - Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout [62.73150122809138]
Federated Learning (FL) is a promising distributed machine learning approach that enables collaborative training of a global model using multiple edge devices.<n>We propose the FedDHAD FL framework, which comes with two novel methods: Dynamic Heterogeneous model aggregation (FedDH) and Adaptive Dropout (FedAD)<n>The combination of these two methods makes FedDHAD significantly outperform state-of-the-art solutions in terms of accuracy (up to 6.7% higher), efficiency (up to 2.02 times faster), and cost (up to 15.0% smaller)
arXiv Detail & Related papers (2025-07-14T16:19:00Z) - Is Architectural Complexity Overrated? Competitive and Interpretable Knowledge Graph Completion with RelatE [6.959701672059059]
RelatE is an interpretable and modular method that efficiently integrates dual representations for entities and relations.<n>It achieves competitive or superior performance on standard benchmarks.<n>Perturbation studies demonstrate improved robustness, with MRR reduced by up to 61% relative to TransE and by up to 19% compared to RotatE.
arXiv Detail & Related papers (2025-05-25T04:36:52Z) - Breaking the Compression Ceiling: Data-Free Pipeline for Ultra-Efficient Delta Compression [53.08742231761896]
UltraDelta is a data-free delta compression pipeline that achieves both ultra-high compression and strong performance.<n>UltraDelta is designed to minimize redundancy, maximize information, and stabilize performance across inter-layer, intra-layer, and global dimensions.
arXiv Detail & Related papers (2025-05-19T10:37:22Z) - Advancing Tabular Stroke Modelling Through a Novel Hybrid Architecture and Feature-Selection Synergy [0.9999629695552196]
The present work develops and validates a data-driven and interpretable machine-learning framework designed to predict strokes.<n>Ten routinely gathered demographic, lifestyle, and clinical variables were sourced from a public cohort of 4,981 records.<n>The proposed model achieved an accuracy rate of 97.2% and an F1-score of 97.15%, indicating a significant enhancement compared to the leading individual model.
arXiv Detail & Related papers (2025-05-18T21:46:45Z) - BoKDiff: Best-of-K Diffusion Alignment for Target-Specific 3D Molecule Generation [0.0]
Structures-based drug design (SBDD) leverages the 3D structure of biomolecular targets to guide the creation of new therapeutic agents.<n>Recent advances in generative models, including geometric models and deep learning, have demonstrated promise in optimizing ligand generation.<n>We propose BoKDiff, a novel framework that enhances ligand generation by combining multi-objective optimization and Best-of-K alignment methodologies.
arXiv Detail & Related papers (2025-01-26T18:29:11Z) - Offline Behavior Distillation [57.6900189406964]
Massive reinforcement learning (RL) data are typically collected to train policies offline without the need for interactions.
We formulate offline behavior distillation (OBD), which synthesizes limited expert behavioral data from sub-optimal RL data.
We propose two naive OBD objectives, DBC and PBC, which measure distillation performance via the decision difference between policies trained on distilled data and either offline data or a near-expert policy.
arXiv Detail & Related papers (2024-10-30T06:28:09Z) - TacoGFN: Target-conditioned GFlowNet for Structure-based Drug Design [3.45184803671951]
TacoGFN is a novel GFlowNet-based approach for structure-based drug design.
It can generate molecules conditioned on any protein pocket structure with probabilities proportional to its affinity and property rewards.
In the generative setting for CrossDocked 2020 benchmark, TacoGFN attains a state-of-the-art success rate of $56.0%$ and $-8.44$ kcal/mol in median Vina Dock score.
arXiv Detail & Related papers (2023-10-05T00:45:04Z) - Generalizing electrocardiogram delineation: training convolutional
neural networks with synthetic data augmentation [63.51064808536065]
Existing databases for ECG delineation are small, being insufficient in size and in the array of pathological conditions they represent.
This article delves has two main contributions. First, a pseudo-synthetic data generation algorithm was developed, based in probabilistically composing ECG traces given "pools" of fundamental segments, as cropped from the original databases, and a set of rules for their arrangement into coherent synthetic traces.
Second, two novel segmentation-based loss functions have been developed, which attempt at enforcing the prediction of an exact number of independent structures and at producing closer segmentation boundaries by focusing on a reduced number of samples.
arXiv Detail & Related papers (2021-11-25T10:11:41Z) - Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects.
We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers.
We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z) - On the Difference Between the Information Bottleneck and the Deep
Information Bottleneck [81.89141311906552]
We revisit the Deep Variational Information Bottleneck and the assumptions needed for its derivation.
We show how to circumvent this limitation by optimising a lower bound for $I(T;Y)$ for which only the latter Markov chain has to be satisfied.
arXiv Detail & Related papers (2019-12-31T18:31:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.