Task-Specific Sparse Feature Masks for Molecular Toxicity Prediction with Chemical Language Models
- URL: http://arxiv.org/abs/2512.11412v1
- Date: Fri, 12 Dec 2025 09:41:04 GMT
- Title: Task-Specific Sparse Feature Masks for Molecular Toxicity Prediction with Chemical Language Models
- Authors: Kwun Sy Lee, Jiawei Chen, Fuk Sheng Ford Chung, Tianyu Zhao, Zhenyuan Chen, Debby D. Wang,
- Abstract summary: We propose a novel multi-task learning (MTL) framework to jointly enhance accuracy and interpretability.<n>Our architecture integrates a shared chemical language model with task-specific attention modules.<n>By imposing an L1 sparsity penalty on these modules, the framework is constrained to focus on a minimal set of salient molecular fragments for each distinct toxicity endpoint.
- Score: 5.563119267291969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reliable in silico molecular toxicity prediction is a cornerstone of modern drug discovery, offering a scalable alternative to experimental screening. However, the black-box nature of state-of-the-art models remains a significant barrier to adoption, as high-stakes safety decisions demand verifiable structural insights alongside predictive performance. To address this, we propose a novel multi-task learning (MTL) framework designed to jointly enhance accuracy and interpretability. Our architecture integrates a shared chemical language model with task-specific attention modules. By imposing an L1 sparsity penalty on these modules, the framework is constrained to focus on a minimal set of salient molecular fragments for each distinct toxicity endpoint. The resulting framework is trained end-to-end and is readily adaptable to various transformer-based backbones. Evaluated on the ClinTox, SIDER, and Tox21 benchmark datasets, our approach consistently outperforms both single-task and standard MTL baselines. Crucially, the sparse attention weights provide chemically intuitive visualizations that reveal the specific fragments influencing predictions, thereby enhancing insight into the model's decision-making process.
Related papers
- Agentic reinforcement learning empowers next-generation chemical language models for molecular design and synthesis [51.83339196548892]
ChemCraft is a novel framework that decouples chemical reasoning from knowledge storage.<n>ChemCraft achieves superior performance with minimal inference costs.<n>This work establishes a cost-effective and privacy-preserving paradigm for AI-aided chemistry.
arXiv Detail & Related papers (2026-01-25T04:23:34Z) - How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning? [51.286853421822705]
Large language models (LLMs) have shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear.<n>We introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures.<n>Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions.
arXiv Detail & Related papers (2026-01-09T20:08:42Z) - Foundation Models for Discovery and Exploration in Chemical Space [57.97784111110166]
MIST is a family of molecular foundation models trained on large unlabeled datasets.<n>We demonstrate the ability of these models to solve real-world problems across chemical space.
arXiv Detail & Related papers (2025-10-20T17:56:01Z) - Reasoning-Enhanced Large Language Models for Molecular Property Prediction [19.593493317167646]
Molecular property prediction is crucial for drug discovery and materials science.<n>Existing approaches suffer from limited interpretability, poor cross-task generalization, and lack of chemical reasoning capabilities.<n>We propose MPPReasoner, a multimodal large language model that incorporates chemical reasoning for molecular property prediction.
arXiv Detail & Related papers (2025-10-11T15:05:45Z) - $\text{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models [59.125833618091846]
We propose a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view.<n>Experiments demonstrate that $textM2$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks.
arXiv Detail & Related papers (2025-08-12T05:46:47Z) - ChemActor: Enhancing Automated Extraction of Chemical Synthesis Actions with LLM-Generated Data [53.78763789036172]
We present ChemActor, a fully fine-tuned large language model (LLM) as a chemical executor to convert between unstructured experimental procedures and structured action sequences.<n>This framework integrates a data selection module that selects data based on distribution divergence, with a general-purpose LLM, to generate machine-executable actions from a single molecule input.<n>Experiments on reaction-to-description (R2D) and description-to-action (D2A) tasks demonstrate that ChemActor achieves state-of-the-art performance, outperforming the baseline model by 10%.
arXiv Detail & Related papers (2025-06-30T05:11:19Z) - Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification? [19.700175505235876]
ToxiMol is the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair.<n>We construct a standardized dataset covering 11 primary tasks and 560 representative toxic molecules spanning diverse mechanisms and granularities.
arXiv Detail & Related papers (2025-06-12T17:25:53Z) - Tokenization for Molecular Foundation Models [0.0]
We systematically evaluate 34 tokenizers, including 19 chemistry-specific ones, and reveal significant gaps in their coverage of the SMILES molecular representation.<n>We propose two new tokenizers -- Smirk and Smirk-GPE -- with full coverage of the OpenSMILES specification.
arXiv Detail & Related papers (2024-09-19T02:36:04Z) - Holistic chemical evaluation reveals pitfalls in reaction prediction
models [0.3065062372337749]
We propose a new assessment scheme that builds on current approaches, steering towards a more holistic evaluation.
ChoRISO is a curated dataset along with multiple tailored splits to recreate chemically relevant scenarios.
Our work paves the way towards robust prediction models that can ultimately accelerate chemical discovery.
arXiv Detail & Related papers (2023-12-14T14:54:28Z) - Learning Invariant Molecular Representation in Latent Discrete Space [52.13724532622099]
We propose a new framework for learning molecular representations that exhibit invariance and robustness against distribution shifts.
Our model achieves stronger generalization against state-of-the-art baselines in the presence of various distribution shifts.
arXiv Detail & Related papers (2023-10-22T04:06:44Z) - Meaningful machine learning models and machine-learned pharmacophores
from fragment screening campaigns [0.0]
We derive machine learning models from over 50 fragment-screening campaigns.
We provide a physically interpretable and verifiable representation of what the ML model considers important for successful binding.
We find good agreement between the key molecular substructures proposed by the ML model and those assigned manually.
arXiv Detail & Related papers (2022-03-25T18:08:55Z) - Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost
Functions [80.12620331438052]
deep learning has become an important tool for rapid screening of billions of molecules in silico for potential hits containing desired chemical features.
Despite its importance, substantial challenges persist in training these models, such as severe class imbalance, high decision thresholds, and lack of ground truth labels in some datasets.
We argue in favor of directly optimizing the receiver operating characteristic (ROC) in such cases, due to its robustness to class imbalance.
arXiv Detail & Related papers (2020-06-25T08:46:37Z) - Predicting drug properties with parameter-free machine learning:
Pareto-Optimal Embedded Modeling (POEM) [0.13854111346209866]
We describe a similarity-based method for predicting molecular properties. POEM is a non-parametric, supervised ML algorithm developed to generate reliable predictive models without need for optimization.
We benchmark POEM relative to industry-standard ML algorithms and published results across 17 classifications tasks. POEM performs well in all cases and reduces the risk of overfitting.
arXiv Detail & Related papers (2020-02-11T17:20:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.