TwinPurify: Purifying gene expression data to reveal tumor-intrinsic transcriptional programs via self-supervised learning
- URL: http://arxiv.org/abs/2601.18640v2
- Date: Tue, 27 Jan 2026 15:04:22 GMT
- Title: TwinPurify: Purifying gene expression data to reveal tumor-intrinsic transcriptional programs via self-supervised learning
- Authors: Zhiwei Zheng, Kevin Bryson,
- Abstract summary: We introduce TwinPurify, a representation learning framework that adapts the Barlow Twins self-supervised objective.<n>Rather than resolving the bulk mixture into discrete cell-type fractions, TwinPurify instead learns continuous, high-dimensional tumor embeddings.
- Score: 4.742294289533828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in single-cell and spatial transcriptomic technologies have transformed tumor ecosystem profiling at cellular resolution. However, large scale studies on patient cohorts continue to rely on bulk transcriptomic data, where variation in tumor purity obscures tumor-intrinsic transcriptional signals and constrains downstream discovery. Many deconvolution methods report strong performance on synthetic bulk mixtures but fail to generalize to real patient cohorts because of unmodeled biological and technical variation. Here, we introduce TwinPurify, a representation learning framework that adapts the Barlow Twins self-supervised objective, representing a fundamental departure from the deconvolution paradigm. Rather than resolving the bulk mixture into discrete cell-type fractions, TwinPurify instead learns continuous, high-dimensional tumor embeddings by leveraging adjacent-normal profiles within the same cohort as "background" guidance, enabling the disentanglement of tumor-specific signals without relying on any external reference. Benchmarked against multiple large cancer cohorts across RNA-seq and microarray platforms, TwinPurify outperforms conventional representation learning baselines like auto-encoders in recovering tumor-intrinsic and immune signals. The purified embeddings improve molecular subtype and grade classification, enhance survival model concordance, and uncover biologically meaningful pathway activities compared to raw bulk profiles. By providing a transferable framework for decontaminating bulk transcriptomics, TwinPurify extends the utility of existing clinical datasets for molecular discovery.
Related papers
- Learning Glioblastoma Tumor Heterogeneity Using Brain Inspired Topological Neural Networks [3.120728330365825]
TopoGBM is a learning framework designed to capture scanner-robust representations from 3D MRI.<n>Mechanistic interpretability analysis reveals that reconstruction residuals are highly localized to pathologically heterogeneous zones.
arXiv Detail & Related papers (2026-02-11T16:28:13Z) - PEaRL: Pathway-Enhanced Representation Learning for Gene and Pathway Expression Prediction from Histology [8.879502752288325]
We present PEaRL (Pathway Enhanced Representation Learning), a framework that represents transcriptomics through pathway activation scores computed with ssGSEA.<n>Across three cancer ST datasets, PEaRL consistently outperforms SOTA methods, yielding higher accuracy for both gene- and pathway-level expression prediction.
arXiv Detail & Related papers (2025-10-03T19:21:23Z) - MS-ConTab: Multi-Scale Contrastive Learning of Mutation Signatures for Pan Cancer Representation and Stratification [0.0]
We introduce a novel unsupervised contrastive learning framework to cluster 43 cancer types.<n>For each cancer type, we construct two complementary mutation signatures.<n>We demonstrate that the resulting latent representations yield biologically meaningful clusters of cancer types.
arXiv Detail & Related papers (2025-08-26T20:42:20Z) - Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges [68.98973318553983]
We propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions.<n>We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way.<n>We also incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles.
arXiv Detail & Related papers (2025-06-26T09:05:38Z) - TransST: Transfer Learning Embedded Spatial Factor Modeling of Spatial Transcriptomics Data [13.71468013489106]
We propose a novel transfer learning framework, referred to as TransST, to adaptively leverage the cell-labeled information from external sources.<n>We show that TransST is both effective and robust in identifying cell subclusters and detecting corresponding driving biomarkers in spatial transcriptomics data.
arXiv Detail & Related papers (2025-04-15T22:03:38Z) - MIRROR: Multi-Modal Pathological Self-Supervised Representation Learning via Modality Alignment and Retention [57.044719143401664]
Histopathology and transcriptomics are fundamental modalities in oncology, encapsulating the morphological and molecular aspects of the disease.<n>We present MIRROR, a novel multi-modal representation learning method designed to foster both modality alignment and retention.<n>Extensive evaluations on TCGA cohorts for cancer subtyping and survival analysis highlight MIRROR's superior performance.
arXiv Detail & Related papers (2025-03-01T07:02:30Z) - Block Graph Neural Networks for tumor heterogeneity prediction [0.3611754783778107]
Accurate tumor classification is essential for selecting effective treatments.<n>Standard tumor grading, which categorizes tumors based on cell differentiation, is not recommended as a stand-alone procedure.<n>We propose to build on a mathematical model that simulates tumor evolution and generate artificial datasets for tumor classification.
arXiv Detail & Related papers (2025-02-08T05:48:09Z) - Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen [76.02070962797794]
This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data.<n>CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Tertiary Lymphoid Structures Generation through Graph-based Diffusion [54.37503714313661]
In this work, we leverage state-of-the-art graph-based diffusion models to generate biologically meaningful cell-graphs.
We show that the adopted graph diffusion model is able to accurately learn the distribution of cells in terms of their tertiary lymphoid structures (TLS) content.
arXiv Detail & Related papers (2023-10-10T14:37:17Z) - Prediction of brain tumor recurrence location based on multi-modal
fusion and nonlinear correlation learning [55.789874096142285]
We present a deep learning-based brain tumor recurrence location prediction network.
We first train a multi-modal brain tumor segmentation network on the public dataset BraTS 2021.
Then, the pre-trained encoder is transferred to our private dataset for extracting the rich semantic features.
Two decoders are constructed to jointly segment the present brain tumor and predict its future tumor recurrence location.
arXiv Detail & Related papers (2023-04-11T02:45:38Z) - CausalBench: A Large-scale Benchmark for Network Inference from
Single-cell Perturbation Data [61.088705993848606]
We introduce CausalBench, a benchmark suite for evaluating causal inference methods on real-world interventional data.
CaulBench incorporates biologically-motivated performance metrics, including new distribution-based interventional metrics.
arXiv Detail & Related papers (2022-10-31T13:04:07Z) - Modelling Technical and Biological Effects in scRNA-seq data with
Scalable GPLVMs [6.708052194104378]
We extend a popular approach for probabilistic non-linear dimensionality reduction, the Gaussian process latent variable model, to scale to massive single-cell datasets.
The key idea is to use an augmented kernel which preserves the factorisability of the lower bound allowing for fast variational inference.
arXiv Detail & Related papers (2022-09-14T15:25:15Z) - Cancer Gene Profiling through Unsupervised Discovery [49.28556294619424]
We introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarkers.
Our method is based on the LP-Stability algorithm, a high dimensional center-based unsupervised clustering algorithm.
Our signature reports promising results on distinguishing immune inflammatory and immune desert tumors.
arXiv Detail & Related papers (2021-02-11T09:04:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.