Fine-Tuning ChemBERTa for Predicting Inhibitory Activity Against TDP1 Using Deep Learning
- URL: http://arxiv.org/abs/2512.04252v1
- Date: Wed, 03 Dec 2025 20:42:22 GMT
- Title: Fine-Tuning ChemBERTa for Predicting Inhibitory Activity Against TDP1 Using Deep Learning
- Authors: Baichuan Zeng,
- Abstract summary: Predicting the potency of small molecules against Tyrosyl-DNA Phosphodiesterase 1 (TDP1) is a critical challenge in early drug discovery.<n>We present a deep learning framework for the quantitative regression of pIC50 values using fine-tuned variants of ChemBERTa.<n>Our approach outperforms classical baselines Random Predictor in both regression accuracy and virtual screening utility.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting the inhibitory potency of small molecules against Tyrosyl-DNA Phosphodiesterase 1 (TDP1)-a key target in overcoming cancer chemoresistance-remains a critical challenge in early drug discovery. We present a deep learning framework for the quantitative regression of pIC50 values from molecular Simplified Molecular Input Line Entry System (SMILES) strings using fine-tuned variants of ChemBERTa, a pre-trained chemical language model. Leveraging a large-scale consensus dataset of 177,092 compounds, we systematically evaluate two pre-training strategies-Masked Language Modeling (MLM) and Masked Token Regression (MTR)-under stratified data splits and sample weighting to address severe activity imbalance which only 2.1% are active. Our approach outperforms classical baselines Random Predictor in both regression accuracy and virtual screening utility, and has competitive performance compared to Random Forest, achieving high enrichment factor EF@1% 17.4 and precision Precision@1% 37.4 among top-ranked predictions. The resulting model, validated through rigorous ablation and hyperparameter studies, provides a robust, ready-to-deploy tool for prioritizing TDP1 inhibitors for experimental testing. By enabling accurate, 3D-structure-free pIC50 prediction directly from SMILES, this work demonstrates the transformative potential of chemical transformers in accelerating target-specific drug discovery.
Related papers
- EnzyCLIP: A Cross-Attention Dual Encoder Framework with Contrastive Learning for Predicting Enzyme Kinetic Constants [2.92594095183629]
We present EnzyCLIP, a novel dual-encoder framework to predict enzyme kinetic parameters from protein sequences and substrate molecular structures.<n>The model is trained on the CatPred-DB database containing 23,151 Kcat and 41,174 Km experimentally validated measurements.<n>XGBoost ensemble methods applied to the learned embeddings further improved Km prediction (R2 = 0.61) while maintaining robust Kcat performance.
arXiv Detail & Related papers (2025-11-29T08:13:06Z) - Predictive Modeling and Explainable AI for Veterinary Safety Profiles, Residue Assessment, and Health Outcomes Using Real-World Data and Physicochemical Properties [4.53318808068234]
Adverse events (AEs) may signal unexpected or toxicokinetic effects, increasing the risk of violative residues in the food chain.<n>This study introduces a predictive framework for classifying outcomes (Death vs. Recovery) using 1.28 million reports from the U.S. FDA's OpenFDA Center for Veterinary Medicine.
arXiv Detail & Related papers (2025-10-01T23:34:46Z) - Valid Property-Enhanced Contrastive Learning for Targeted Optimization & Resampling for Novel Drug Design [1.4874449172133888]
VECTOR+ is a framework that couples property-guided representation learning with controllable molecule generation.<n>VECTOR+ generates novel, synthetically tractable candidates.<n>VECTOR+ generalizes to kinase inhibitors, producing compounds with stronger docking scores than established drugs.
arXiv Detail & Related papers (2025-08-31T03:55:29Z) - SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction [16.189335444981353]
Predicting the absorption, distribution, metabolism, excretion, and toxicity of small-molecule drugs is critical for ensuring safety and efficacy.
We propose a two-stage model that leverages both unlabeled and labeled data through a combination of self-supervised pretraining and fine-tuning strategies.
Our results demonstrate that SMILES-Mamba exhibits competitive performance across 22 ADMET datasets, achieving the highest score in 14 tasks.
arXiv Detail & Related papers (2024-08-11T04:53:12Z) - YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention [9.018408514318631]
Traditional methods often miss complex molecular structures, leading to inaccuracies.
We introduce the YZS-Model, a deep learning framework integrating Graph Convolutional Networks (GCN), Transformer architectures, and Long Short-Term Memory (LSTM) networks.
YZS-Model achieved an $R2$ of 0.59 and an RMSE of 0.57, outperforming benchmark models.
arXiv Detail & Related papers (2024-06-27T12:40:29Z) - Regressor-free Molecule Generation to Support Drug Response Prediction [83.25894107956735]
Conditional generation based on the target IC50 score can obtain a more effective sampling space.
Regressor-free guidance combines a diffusion model's score estimation with a regression controller model's gradient based on number labels.
arXiv Detail & Related papers (2024-05-23T13:22:17Z) - Efficient Prediction of Peptide Self-assembly through Sequential and
Graphical Encoding [57.89530563948755]
This work provides a benchmark analysis of peptide encoding with advanced deep learning models.
It serves as a guide for a wide range of peptide-related predictions such as isoelectric points, hydration free energy, etc.
arXiv Detail & Related papers (2023-07-17T00:43:33Z) - MetaRF: Differentiable Random Forest for Reaction Yield Prediction with
a Few Trails [58.47364143304643]
In this paper, we focus on the reaction yield prediction problem.
We first put forth MetaRF, an attention-based differentiable random forest model specially designed for the few-shot yield prediction.
To improve the few-shot learning performance, we further introduce a dimension-reduction based sampling method.
arXiv Detail & Related papers (2022-08-22T06:40:13Z) - SPLDExtraTrees: Robust machine learning approach for predicting kinase
inhibitor resistance [1.0674604700001966]
We propose a robust machine learning method, SPLDExtraTrees, which can accurately predict ligand binding affinity changes upon protein mutation.
The proposed method ranks training data following a specific scheme that starts with easy-to-learn samples.
Experiments substantiate the capability of the proposed method for predicting kinase inhibitor resistance under three scenarios.
arXiv Detail & Related papers (2021-11-15T09:07:45Z) - Improved Drug-target Interaction Prediction with Intermolecular Graph
Transformer [98.8319016075089]
We propose a novel approach to model intermolecular information with a three-way Transformer-based architecture.
Intermolecular Graph Transformer (IGT) outperforms state-of-the-art approaches by 9.1% and 20.5% over the second best for binding activity and binding pose prediction respectively.
IGT exhibits promising drug screening ability against SARS-CoV-2 by identifying 83.1% active drugs that have been validated by wet-lab experiments with near-native predicted binding poses.
arXiv Detail & Related papers (2021-10-14T13:28:02Z) - Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets.
Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z) - Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost
Functions [80.12620331438052]
deep learning has become an important tool for rapid screening of billions of molecules in silico for potential hits containing desired chemical features.
Despite its importance, substantial challenges persist in training these models, such as severe class imbalance, high decision thresholds, and lack of ground truth labels in some datasets.
We argue in favor of directly optimizing the receiver operating characteristic (ROC) in such cases, due to its robustness to class imbalance.
arXiv Detail & Related papers (2020-06-25T08:46:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.