Integrating Protein Sequence and Expression Level to Analysis Molecular Characterization of Breast Cancer Subtypes
- URL: http://arxiv.org/abs/2410.01755v3
- Date: Thu, 30 Oct 2025 17:28:23 GMT
- Title: Integrating Protein Sequence and Expression Level to Analysis Molecular Characterization of Breast Cancer Subtypes
- Authors: Hossein Sholehrasa, Majid Jaberi-Douraki,
- Abstract summary: This study aims to integrate protein sequence data with expression levels to improve the molecular characterization of breast cancer subtypes.<n>Using ProtGPT2, a language model specifically designed for protein sequences, we generated embeddings that capture the functional and structural properties of proteins.<n>These embeddings were integrated with protein expression levels to form enriched biological representations, which were analyzed using machine learning methods.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Breast cancer's complexity and variability pose significant challenges in understanding its progression and guiding effective treatment. This study aims to integrate protein sequence data with expression levels to improve the molecular characterization of breast cancer subtypes and predict clinical outcomes. Using ProtGPT2, a language model specifically designed for protein sequences, we generated embeddings that capture the functional and structural properties of proteins. These embeddings were integrated with protein expression levels to form enriched biological representations, which were analyzed using machine learning methods, such as ensemble K-means for clustering and XGBoost for classification. Our approach enabled the successful clustering of patients into biologically distinct groups and accurately predicted clinical outcomes such as survival and biomarker status, achieving high performance metrics, notably an F1 score of 0.88 for survival and 0.87 for biomarker status prediction. Feature importance analysis identified KMT2C, CLASP2, and MYO1B as key proteins involved in hormone signaling, cytoskeletal remodeling, and therapy resistance in hormone receptor-positive and triple-negative breast cancer, with potential influence on breast cancer subtype behavior and progression. Furthermore, protein-protein interaction networks and correlation analyses revealed functional interdependencies among proteins that may influence the behavior and progression of breast cancer subtypes. These findings suggest that integrating protein sequence and expression data provides valuable insights into tumor biology and has significant potential to enhance personalized treatment strategies in breast cancer care.
Related papers
- Modeling Dabrafenib Response Using Multi-Omics Modality Fusion and Protein Network Embeddings Based on Graph Convolutional Networks [0.0]
Cancer cell response to targeted therapy arises from complex molecular interactions, making single omics insufficient for accurate prediction.<n>This study develops a model to predict Dabrafenib sensitivity by integrating multiple omics layers (genomics, transcriptomics, epigenomics, metabolomics) with protein network embeddings generated using Graph Convolutional Networks (GCN)<n>Results show that attention guided multi omics fusion combined with GCN improves drug response prediction and reveals complementary molecular determinants of Dabrafenib sensitivity.
arXiv Detail & Related papers (2025-12-13T02:00:56Z) - KEPLA: A Knowledge-Enhanced Deep Learning Framework for Accurate Protein-Ligand Binding Affinity Prediction [60.23701115249195]
KEPLA is a novel deep learning framework that integrates prior knowledge from Gene Ontology and ligand properties to enhance prediction performance.<n> Experiments on two benchmark datasets demonstrate that KEPLA consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-16T08:02:42Z) - Strategic priorities for transformative progress in advancing biology with proteomics and artificial intelligence [54.14779179869007]
We highlight key areas where AI is driving innovation, from data analysis to new biological insights.
These include developing an AI-friendly ecosystem for data generation, sharing, and analysis.
arXiv Detail & Related papers (2025-02-21T13:20:33Z) - Multivariate Feature Selection and Autoencoder Embeddings of Ovarian Cancer Clinical and Genetic Data [2.973561339858947]
This study explores a data-driven approach to discovering novel clinical and genetic markers in ovarian cancer (OC)
In the autoencoder analysis, a clearer pattern emerged when using clinical features and the combination of clinical and genetic data.
Key clinical variables (such as type of surgery and neoadjuvant chemotherapy) and certain gene mutations showed strong relevance, along with low-risk genetic factors.
arXiv Detail & Related papers (2025-01-27T09:07:07Z) - SFM-Protein: Integrative Co-evolutionary Pre-training for Advanced Protein Sequence Representation [97.99658944212675]
We introduce a novel pre-training strategy for protein foundation models.
It emphasizes the interactions among amino acid residues to enhance the extraction of both short-range and long-range co-evolutionary features.
Trained on a large-scale protein sequence dataset, our model demonstrates superior generalization ability.
arXiv Detail & Related papers (2024-10-31T15:22:03Z) - Long-context Protein Language Model [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.
Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.
We propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built off selective structured state-space models.
We also introduce its graph-contextual variant, LC-PLM-G, which contextualizes protein-protein interaction graphs for a second stage of training.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - Binding Affinity Prediction: From Conventional to Machine Learning-Based Approaches [48.66541987908136]
Much work has been devoted to predicting binding affinity over the past decades.<n>We note growing use of both traditional machine learning and deep learning models for predicting binding affinity.<n>With improved predictive performance and the FDA's phasing out of animal testing, AI-driven in silico models, such as AI virtual cells (AIVCs), are poised to advance binding affinity prediction.
arXiv Detail & Related papers (2024-09-30T03:40:49Z) - Beyond ESM2: Graph-Enhanced Protein Sequence Modeling with Efficient Clustering [24.415612744612773]
Proteins are essential to life's processes, underpinning evolution and diversity.
Advances in sequencing technology have revealed millions of proteins, underscoring the need for sophisticated pre-trained protein models for biological analysis and AI development.
Facebook's ESM2, the most advanced protein language model to date, leverages a masked prediction task for unsupervised learning, crafting amino acid representations with notable biochemical accuracy.
Yet, it lacks in delivering functional protein insights, signaling an opportunity for enhancing representation quality.
This study addresses this gap by incorporating protein family classification into ESM2's training, while a contextual prediction task fine-tunes local
arXiv Detail & Related papers (2024-04-24T11:09:43Z) - Automated HER2 Scoring in Breast Cancer Images Using Deep Learning and Pyramid Sampling [3.711848341917877]
We introduce a deep learning-based approach utilizing pyramid sampling for the automated classification of HER2 status in IHC-stained BC tissue images.
Our approach analyzes morphological features at various spatial scales, efficiently managing the computational load and facilitating a detailed examination of cellular and larger-scale tissue-level details.
arXiv Detail & Related papers (2024-04-01T00:23:22Z) - Clustering for Protein Representation Learning [72.72957540484664]
We propose a neural clustering framework that can automatically discover the critical components of a protein.
Our framework treats a protein as a graph, where each node represents an amino acid and each edge represents a spatial or sequential connection between amino acids.
We evaluate on four protein-related tasks: protein fold classification, enzyme reaction classification, gene term prediction, and enzyme commission number prediction.
arXiv Detail & Related papers (2024-03-30T05:51:09Z) - One-dimensional convolutional neural network model for breast cancer
subtypes classification and biochemical content evaluation using micro-FTIR
hyperspectral images [0.0]
This study created a 1D deep learning tool for breast cancer subtype evaluation and biochemical contribution.
CaReNet-V1, a novel 1D convolutional neural network, was developed to classify breast cancer (CA) and adjacent tissue (AT)
A 1D adaptation of Grad-CAM was applied to assess the biochemical impact to the classifications.
arXiv Detail & Related papers (2023-10-23T16:58:34Z) - hist2RNA: An efficient deep learning architecture to predict gene
expression from breast cancer histopathology images [11.822321981275232]
Deep learning algorithms can effectively extract morphological patterns in digital histopathology images to predict molecular phenotypes quickly and cost-effectively.
We propose a new, computationally efficient approach called hist2RNA inspired by bulk RNA-sequencing techniques to predict the expression of 138 genes.
arXiv Detail & Related papers (2023-04-10T10:54:32Z) - Structure-informed Language Models Are Protein Designers [69.70134899296912]
We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs)
We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness.
Experiments show that our approach outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2023-02-03T10:49:52Z) - Functional Integrative Bayesian Analysis of High-dimensional
Multiplatform Genomic Data [0.8029049649310213]
We propose a framework called Functional Integrative Bayesian Analysis of High-dimensional Multiplatform Genomic Data (fiBAG)
fiBAG allows simultaneous identification of upstream functional evidence of proteogenomic biomarkers.
We demonstrate the profitability of fiBAG via a pan-cancer analysis of 14 cancer types.
arXiv Detail & Related papers (2022-12-29T03:31:45Z) - State-specific protein-ligand complex structure prediction with a
multi-scale deep generative model [68.28309982199902]
We present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures.
Our study suggests that a data-driven approach can capture the structural cooperativity between proteins and small molecules, showing promise in accelerating the design of enzymes, drug molecules, and beyond.
arXiv Detail & Related papers (2022-09-30T01:46:38Z) - Learning Geometrically Disentangled Representations of Protein Folding
Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein.
Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules.
Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.