scE$^2$TM: Toward Interpretable Single-Cell Embedding via Topic Modeling
- URL: http://arxiv.org/abs/2507.08355v1
- Date: Fri, 11 Jul 2025 07:15:13 GMT
- Title: scE$^2$TM: Toward Interpretable Single-Cell Embedding via Topic Modeling
- Authors: Hegang Chen, Yuyin Lu, Zhiming Dai, Fu Lee Wang, Qing Li, Yanghui Rao,
- Abstract summary: We present scE2TM, an external knowledge-guided single-cell embedded topic model that provides a high-quality cell embedding and strong interpretation.<n>Our comprehensive evaluation across 20 scRNA-seq datasets demonstrates that scE2TM achieves significant clustering performance gains.
- Score: 21.79077173300944
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in sequencing technologies have enabled researchers to explore cellular heterogeneity at single-cell resolution. Meanwhile, interpretability has gained prominence parallel to the rapid increase in the complexity and performance of deep learning models. In recent years, topic models have been widely used for interpretable single-cell embedding learning and clustering analysis, which we refer to as single-cell embedded topic models. However, previous studies evaluated the interpretability of the models mainly through qualitative analysis, and these single-cell embedded topic models suffer from the potential problem of interpretation collapse. Furthermore, their neglect of external biological knowledge constrains analytical performance. Here, we present scE2TM, an external knowledge-guided single-cell embedded topic model that provides a high-quality cell embedding and strong interpretation, contributing to comprehensive scRNA-seq data analysis. Our comprehensive evaluation across 20 scRNA-seq datasets demonstrates that scE2TM achieves significant clustering performance gains compared to 7 state-of-the-art methods. In addition, we propose a new interpretability evaluation benchmark that introduces 10 metrics to quantitatively assess the interpretability of single-cell embedded topic models. The results show that the interpretation provided by scE2TM performs encouragingly in terms of diversity and consistency with the underlying biological signals, contributing to a better revealing of the underlying biological mechanisms.
Related papers
- Information-theoretic Quantification of High-order Feature Effects in Classification Problems [0.19791587637442676]
We present an information-theoretic extension of the High-order interactions for Feature importance (Hi-Fi) method.<n>Our framework decomposes feature contributions into unique, synergistic, and redundant components.<n>Results indicate that the proposed estimator accurately recovers theoretical and expected findings.
arXiv Detail & Related papers (2025-07-06T11:50:30Z) - CellVerse: Do Large Language Models Really Understand Cell Biology? [74.34984441715517]
We introduce CellVerse, a unified language-centric question-answering benchmark that integrates four types of single-cell multi-omics data.<n>We systematically evaluate the performance across 14 open-source and closed-source LLMs ranging from 160M to 671B on CellVerse.
arXiv Detail & Related papers (2025-05-09T06:47:23Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Benchmarking Transcriptomics Foundation Models for Perturbation Analysis : one PCA still rules them all [1.507700065820919]
Recent advancements in transcriptomics sequencing provide new opportunities to uncover valuable insights.
No benchmark has been made to robustly evaluate the effectiveness of these rising models for perturbation analysis.
This article presents a novel biologically motivated evaluation framework and a hierarchy of perturbation analysis tasks.
arXiv Detail & Related papers (2024-10-17T18:27:51Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Interpretable deep learning in single-cell omics [5.0934939727101565]
We introduce the basics of single-cell omics technologies and the concept of interpretable deep learning.
This is followed by a review of the recent interpretable deep learning models applied to various single-cell omics research.
arXiv Detail & Related papers (2024-01-11T23:59:37Z) - Improving Biomedical Entity Linking with Retrieval-enhanced Learning [53.24726622142558]
$k$NN-BioEL provides a BioEL model with the ability to reference similar instances from the entire training corpus as clues for prediction.
We show that $k$NN-BioEL outperforms state-of-the-art baselines on several datasets.
arXiv Detail & Related papers (2023-12-15T14:04:23Z) - Regression-Based Analysis of Multimodal Single-Cell Data Integration
Strategies [0.0]
Multimodal single-cell technologies enable the simultaneous collection of diverse data types from individual cells.
This study highlights the exceptional performance of Echo State Networks, boasting a remarkable correlation score of 0.94.
These findings hold promise for advancing comprehension of cellular differentiation and function, leveraging the potential of Machine Learning.
arXiv Detail & Related papers (2023-11-21T16:31:27Z) - Causal machine learning for single-cell genomics [94.28105176231739]
We discuss the application of machine learning techniques to single-cell genomics and their challenges.
We first present the model that underlies most of current causal approaches to single-cell biology.
We then identify open problems in the application of causal approaches to single-cell data.
arXiv Detail & Related papers (2023-10-23T13:35:24Z) - Out of Distribution Generalization via Interventional Style Transfer in
Single-Cell Microscopy [1.7778546320705952]
Real-world deployment of computer vision systems requires causal representations that are invariant to contextual nuisances.
We propose tests to assess the extent to which models learn causal representations across increasingly challenging levels of OOD-generalization.
We show that despite seemingly strong performance, as assessed by other established metrics, both naive and contemporary baselines designed to ward against confounding, collapse on these tests.
arXiv Detail & Related papers (2023-06-15T20:08:16Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.