A million-scale dataset and generalizable foundation model for nanomaterial-protein interactions
- URL: http://arxiv.org/abs/2507.14245v1
- Date: Fri, 18 Jul 2025 00:00:52 GMT
- Title: A million-scale dataset and generalizable foundation model for nanomaterial-protein interactions
- Authors: Hengjie Yu, Kenneth A. Dawson, Haiyun Yang, Shuya Liu, Yan Yan, Yaochu Jin,
- Abstract summary: We propose NanoPro-3M, the largest nanomaterial-protein interaction dataset to date, comprising over 3.2 million samples and 37,000 unique proteins.<n>We present NanoProFormer, a foundational model that predicts nanomaterial-protein affinities through multimodal representation learning.
- Score: 22.339823160991934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unlocking the potential of nanomaterials in medicine and environmental science hinges on understanding their interactions with proteins, a complex decision space where AI is poised to make a transformative impact. However, progress has been hindered by limited datasets and the restricted generalizability of existing models. Here, we propose NanoPro-3M, the largest nanomaterial-protein interaction dataset to date, comprising over 3.2 million samples and 37,000 unique proteins. Leveraging this, we present NanoProFormer, a foundational model that predicts nanomaterial-protein affinities through multimodal representation learning, demonstrating strong generalization, handling missing features, and unseen nanomaterials or proteins. We show that multimodal modeling significantly outperforms single-modality approaches and identifies key determinants of corona formation. Furthermore, we demonstrate its applicability to a range of downstream tasks through zero-shot inference and fine-tuning. Together, this work establishes a solid foundation for high-performance and generalized prediction of nanomaterial-protein interaction endpoints, reducing experimental reliance and accelerating various in vitro applications.
Related papers
- MOFSimBench: Evaluating Universal Machine Learning Interatomic Potentials In Metal--Organic Framework Molecular Modeling [0.19506923346234722]
Universal machine learning interatomic potentials (uMLIPs) have emerged as powerful tools for accelerating atomistic simulations.<n>We introduce MOFSimBench, a benchmark to evaluate uMLIPs on key materials modeling tasks for nanoporous materials.<n>We find that top-performing uMLIPs consistently outperform classical force fields and fine-tuned machine learning potentials across all tasks.
arXiv Detail & Related papers (2025-07-16T00:00:55Z) - NbBench: Benchmarking Language Models for Comprehensive Nanobody Tasks [6.485214172837228]
We introduce NbBench, the first comprehensive benchmark suite for nanobody representation learning.<n>NbBench encompasses structure annotation, binding prediction, and developability assessment.<n>Our analysis reveals that antibody language models excel in antigen-related tasks, while performance on regression tasks such as thermostability and affinity remains challenging.
arXiv Detail & Related papers (2025-05-04T08:18:10Z) - An All-Atom Generative Model for Designing Protein Complexes [49.09672038729524]
APM (All-Atom Protein Generative Model) is a model specifically designed for modeling multi-chain proteins.<n>It is capable of precisely modeling inter-chain interactions and designing protein complexes with binding capabilities from scratch.<n>It also performs folding and inverse-folding tasks for multi-chain proteins.
arXiv Detail & Related papers (2025-04-17T16:37:41Z) - UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion [61.690978792873196]
Existing approaches rely on either autoregressive sequence models or diffusion models.<n>We propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models.<n>We validate the effectiveness of UniGenX on material and small molecule generation tasks.
arXiv Detail & Related papers (2025-03-09T16:43:07Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - ProteinBench: A Holistic Evaluation of Protein Foundation Models [53.59325047872512]
We introduce ProteinBench, a holistic evaluation framework for protein foundation models.
Our approach consists of three key components: (i) A taxonomic classification of tasks that broadly encompass the main challenges in the protein domain, based on the relationships between different protein modalities; (ii) A multi-metric evaluation approach that assesses performance across four key dimensions: quality, novelty, diversity, and robustness; and (iii) In-depth analyses from various user objectives, providing a holistic view of model performance.
arXiv Detail & Related papers (2024-09-10T06:52:33Z) - Unveiling the Potential of AI for Nanomaterial Morphology Prediction [0.0]
This study explores the potential of AI to predict the morphology of nanoparticles within the data availability constraints.
We first generated a new multi-modal dataset that is double the size of analogous studies.
arXiv Detail & Related papers (2024-05-31T19:16:07Z) - Protein binding affinity prediction under multiple substitutions applying eGNNs on Residue and Atomic graphs combined with Language model information: eGRAL [1.840390797252648]
Deep learning is increasingly recognized as a powerful tool capable of bridging the gap between in-silico predictions and in-vitro observations.
We propose eGRAL, a novel graph neural network architecture designed for predicting binding affinity changes from amino acid substitutions in protein complexes.
eGRAL leverages residue, atomic and evolutionary scales, thanks to features extracted from protein large language models.
arXiv Detail & Related papers (2024-05-03T10:33:19Z) - Quantifying & Modeling Multimodal Interactions: An Information
Decomposition Framework [89.8609061423685]
We propose an information-theoretic approach to quantify the degree of redundancy, uniqueness, and synergy relating input modalities with an output task.
To validate PID estimation, we conduct extensive experiments on both synthetic datasets where the PID is known and on large-scale multimodal benchmarks.
We demonstrate their usefulness in (1) quantifying interactions within multimodal datasets, (2) quantifying interactions captured by multimodal models, (3) principled approaches for model selection, and (4) three real-world case studies.
arXiv Detail & Related papers (2023-02-23T18:59:05Z) - State-specific protein-ligand complex structure prediction with a
multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures.
Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z) - Functional Nanomaterials Design in the Workflow of Building
Machine-Learning Models [0.0]
Machine-learning (ML) techniques have revolutionized a host of research fields of chemical and materials science.
ML provides a more comprehensive insight into combinations with molecules/materials.
The key to the advances in nanomaterials discovery is how input fingerprints and output values can be linked quantitatively.
arXiv Detail & Related papers (2021-08-16T05:51:03Z) - Machine Learning in Nano-Scale Biomedical Engineering [77.75587007080894]
We review the existing research regarding the use of machine learning in nano-scale biomedical engineering.
The main challenges that can be formulated as ML problems are classified into the three main categories.
For each of the presented methodologies, special emphasis is given to its principles, applications, and limitations.
arXiv Detail & Related papers (2020-08-05T15:45:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.