ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibility Data
- URL: http://arxiv.org/abs/2505.12638v2
- Date: Tue, 20 May 2025 02:40:30 GMT
- Title: ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibility Data
- Authors: Yifeng Jiao, Yuchen Liu, Yu Zhang, Xin Guo, Yushuai Wu, Chen Jiang, Jiyang Li, Hongwei Zhang, Limei Han, Xin Gao, Yuan Qi, Yuan Cheng,
- Abstract summary: Single-cell Assay for Transposase- Chromatin using sequencing (scATAC-eq)<n>ChromFound is a foundation model tailored for scATAC-eq.
- Score: 24.68667275455985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The advent of single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) offers an innovative perspective for deciphering regulatory mechanisms by assembling a vast repository of single-cell chromatin accessibility data. While foundation models have achieved significant success in single-cell transcriptomics, there is currently no foundation model for scATAC-seq that supports zero-shot high-quality cell identification and comprehensive multi-omics analysis simultaneously. Key challenges lie in the high dimensionality and sparsity of scATAC-seq data, as well as the lack of a standardized schema for representing open chromatin regions (OCRs). Here, we present ChromFound, a foundation model tailored for scATAC-seq. ChromFound utilizes a hybrid architecture and genome-aware tokenization to effectively capture genome-wide long contexts and regulatory signals from dynamic chromatin landscapes. Pretrained on 1.97 million cells from 30 tissues and 6 disease conditions, ChromFound demonstrates broad applicability across 6 diverse tasks. Notably, it achieves robust zero-shot performance in generating universal cell representations and exhibits excellent transferability in cell type annotation and cross-omics prediction. By uncovering enhancer-gene links undetected by existing computational methods, ChromFound offers a promising framework for understanding disease risk variants in the noncoding genome.
Related papers
- Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges [68.98973318553983]
We propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions.<n>We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way.<n>We also incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles.
arXiv Detail & Related papers (2025-06-26T09:05:38Z) - scMamba: A Scalable Foundation Model for Single-Cell Multi-Omics Integration Beyond Highly Variable Feature Selection [5.139014238424409]
scMamba is a model designed to integrate single-cell multi-omics data without the need for prior feature selection.<n> scMamba distills rich biological insights from high-dimensional, sparse single-cell multi-omics data.<n>Our findings position scMamba as a powerful tool for large-scale single-cell multi-omics integration.
arXiv Detail & Related papers (2025-06-25T12:58:01Z) - An Inclusive Foundation Model for Generalizable Cytogenetics in Precision Oncology [18.252994255843813]
CHROMA is a foundation model for cytogenomics designed to overcome challenges by learning generalizable representations of chromosomal abnormalities.<n>It is pre-trained on over 84,000 specimens (4 million chromosomal images) via self-supervised learning.<n>CHROMA offers a scalable and generalizable solution for reliable and automated clinical analysis.
arXiv Detail & Related papers (2025-05-21T12:03:37Z) - UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion [61.690978792873196]
Existing approaches rely on either autoregressive sequence models or diffusion models.<n>We propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models.<n>We validate the effectiveness of UniGenX on material and small molecule generation tasks.
arXiv Detail & Related papers (2025-03-09T16:43:07Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - scReader: Prompting Large Language Models to Interpret scRNA-seq Data [12.767105992391555]
We propose an innovative hybrid approach that integrates the general knowledge capabilities of large language models with domain-specific representation models for single-cell omics data interpretation.<n>By inputting single-cell gene-level expression data with prompts, we effectively model cellular representations based on the differential expression levels of genes across various species and cell types.
arXiv Detail & Related papers (2024-12-24T04:28:42Z) - Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen [76.02070962797794]
This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data.<n>CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Scalable Amortized GPLVMs for Single Cell Transcriptomics Data [9.010523724015398]
Dimensionality reduction is crucial for analyzing large-scale single-cell RNA-seq data.
We introduce an improved model, the amortized variational model (BGPLVM)
BGPLVM is tailored for single-cell RNA-seq with specialized encoder, kernel, and likelihood designs.
arXiv Detail & Related papers (2024-05-06T21:54:38Z) - sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures [0.9674145073701153]
sc-OTGM is an unsupervised model grounded in the inductive bias that the scRNAseq data can be generated.
sc-OTGM is effective in cell state classification, aids in the analysis of differential gene expression, and ranks genes for target identification.
It also predicts the effects of single-gene perturbations on downstream gene regulation and generates synthetic scRNA-seq data conditioned on specific cell states.
arXiv Detail & Related papers (2024-05-06T06:46:11Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Mixed Models with Multiple Instance Learning [51.440557223100164]
We introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL)
Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets.
arXiv Detail & Related papers (2023-11-04T16:42:42Z) - Masked conditional variational autoencoders for chromosome straightening [14.665481276886194]
Karyotyping is of importance for detecting chromosomal aberrations in human disease.
chromosomes easily appear curved in microscopic images, which prevents cytogeneticists from analyzing chromosome types.
We propose a framework for chromosome straightening, which comprises a preliminary processing algorithm and a generative model.
arXiv Detail & Related papers (2023-06-25T05:11:41Z) - A biology-driven deep generative model for cell-type annotation in
cytometry [0.0]
We introduce Scyan, a Single-cell Cytometry Network that automatically annotates cell types using only prior expert knowledge.
Scyan significantly outperforms the related state-of-the-art models on multiple public datasets while being faster and interpretable.
In addition, Scyan overcomes several complementary tasks such as batch-effect removal, debarcoding, and population discovery.
arXiv Detail & Related papers (2022-08-11T10:50:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.