scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis
- URL: http://arxiv.org/abs/2509.25884v2
- Date: Mon, 10 Nov 2025 03:55:13 GMT
- Title: scUnified: An AI-Ready Standardized Resource for Single-Cell RNA Sequencing Analysis
- Authors: Ping Xu, Zaitian Wang, Zhirui Wang, Pengjiang Li, Ran Zhang, Gaoyang Li, Hanyu Xie, Jiajia Wang, Yuanchun Zhou, Pengfei Wang,
- Abstract summary: We present scUnified, an AI-ready standardized resource for single-cell RNA sequencing data.<n> scUnified consolidates 13 high-quality datasets spanning two species and nine tissue types.
- Score: 23.973638982075016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Single-cell RNA sequencing (scRNA-seq) technology enables systematic delineation of cellular states and interactions, providing crucial insights into cellular heterogeneity. Building on this potential, numerous computational methods have been developed for tasks such as cell clustering, cell type annotation, and marker gene identification. To fully assess and compare these methods, standardized, analysis-ready datasets are essential. However, such datasets remain scarce, and variations in data formats, preprocessing workflows, and annotation strategies hinder reproducibility and complicate systematic evaluation of existing methods. To address these challenges, we present scUnified, an AI-ready standardized resource for single-cell RNA sequencing data that consolidates 13 high-quality datasets spanning two species (human and mouse) and nine tissue types. All datasets undergo standardized quality control and preprocessing and are stored in a uniform format to enable direct application in diverse computational analyses without additional data cleaning. We further demonstrate the utility of scUnified through experimental analyses of representative biological tasks, providing a reproducible foundation for the standardized evaluation of computational methods on a unified dataset.
Related papers
- Parameter-free representations outperform single-cell foundation models on downstream benchmarks [0.0]
Single-cell RNA sequencing (scRNA-seq) data exhibit strong and reproducible statistical structure.<n>Large-scale foundation models, such as TranscriptFormer, learn a generative model for gene expression by embedding genes into a latent vector space.<n>We ask whether similar performance can be achieved without utilizing computationally intensive deep learning-based representations.
arXiv Detail & Related papers (2026-02-18T18:42:29Z) - scCluBench: Comprehensive Benchmarking of Clustering Algorithms for Single-Cell RNA Sequencing [24.35296206096082]
scCluBench is a comprehensive benchmark of clustering algorithms for scRNA-seq data.<n>First, scCluBench provides 36 scRNA-seq datasets collected from diverse public sources.
arXiv Detail & Related papers (2025-12-02T07:04:38Z) - scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration [53.683726781791385]
We introduce a scalable and flexible generative framework called single-cell Multi-omics Regularized Disentangled Representations (scMRDR) for unpaired multi-omics integration.<n>Our method achieves excellent performance on benchmark datasets in terms of batch correction, modality alignment, and biological signal preservation.
arXiv Detail & Related papers (2025-10-28T21:28:39Z) - DeepSeq: High-Throughput Single-Cell RNA Sequencing Data Labeling via Web Search-Augmented Agentic Generative AI Foundation Models [0.0]
Generative AI foundation models offer transformative potential for processing structured biological data.<n>We propose the use of agentic foundation models with real-time web search to automate the labeling of experimental data, achieving up to 82.5% accuracy.
arXiv Detail & Related papers (2025-06-14T23:30:22Z) - scDD: Latent Codes Based scRNA-seq Dataset Distillation with Foundation Model Knowledge [14.12713117447183]
Single-cell RNA sequencing (scRNA-seq) has profiled hundreds of millions of human cells across organs, diseases, development and perturbations to date.<n>High-dimensional sparsity, batch effect noise, category imbalance, and ever-increasing data scale pose challenges for multi-center knowledge transfer, data fusion, and cross-validation.<n>We propose a latent codes-based scRNA-seq dataset distillation framework named scDD, which distills foundation model knowledge and original dataset information into a compact latent space.<n>We also propose a single-step conditional diffusion generator named SCDG, which perform single-step
arXiv Detail & Related papers (2025-03-06T12:01:20Z) - Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data [13.56585855722118]
Large language models (LLMs) have demonstrated their ability to efficiently process and synthesize vast corpora of text to automatically extract biological knowledge.<n>Our study explores the potential of LLMs to accurately classify and annotate cell types in single-cell RNA sequencing (scRNA-seq) data.<n>The results demonstrate that LLMs can provide robust interpretations of single-cell data without requiring additional fine-tuning.
arXiv Detail & Related papers (2024-12-03T23:58:35Z) - Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen [76.02070962797794]
This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data.<n>CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - BEACON: Benchmark for Comprehensive RNA Tasks and Language Models [60.02663015002029]
We introduce the first comprehensive RNA benchmark BEACON (textbfBEnchmtextbfArk for textbfCOmprehensive RtextbfNA Task and Language Models).<n>First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications.<n>Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models.<n>Third, we investigate the vital RNA language model components
arXiv Detail & Related papers (2024-06-14T19:39:19Z) - UniCell: Universal Cell Nucleus Classification via Prompt Learning [76.11864242047074]
We propose a universal cell nucleus classification framework (UniCell)
It employs a novel prompt learning mechanism to uniformly predict the corresponding categories of pathological images from different dataset domains.
In particular, our framework adopts an end-to-end architecture for nuclei detection and classification, and utilizes flexible prediction heads for adapting various datasets.
arXiv Detail & Related papers (2024-02-20T11:50:27Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Is your data alignable? Principled and interpretable alignability
testing and integration of single-cell data [24.457344926393397]
Single-cell data integration can provide a comprehensive molecular view of cells.
Existing methods suffer from several fundamental limitations.
We present a spectral manifold alignment and inference framework.
arXiv Detail & Related papers (2023-08-03T16:04:14Z) - Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics [44.97217246897902]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z) - Review of Single-cell RNA-seq Data Clustering for Cell Type
Identification and Characterization [12.655970720359297]
Unsupervised learning has become the central component to identify and characterize novel cell types and gene expression patterns.
We review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations.
We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on two single-cell transcriptomic datasets.
arXiv Detail & Related papers (2020-01-03T22:48:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.