scDD: Latent Codes Based scRNA-seq Dataset Distillation with Foundation Model Knowledge
- URL: http://arxiv.org/abs/2503.04357v1
- Date: Thu, 06 Mar 2025 12:01:20 GMT
- Title: scDD: Latent Codes Based scRNA-seq Dataset Distillation with Foundation Model Knowledge
- Authors: Zhen Yu, Jianan Han, Yang Liu, Qingchao Chen,
- Abstract summary: Single-cell RNA sequencing (scRNA-seq) has profiled hundreds of millions of human cells across organs, diseases, development and perturbations to date.<n>High-dimensional sparsity, batch effect noise, category imbalance, and ever-increasing data scale pose challenges for multi-center knowledge transfer, data fusion, and cross-validation.<n>We propose a latent codes-based scRNA-seq dataset distillation framework named scDD, which distills foundation model knowledge and original dataset information into a compact latent space.<n>We also propose a single-step conditional diffusion generator named SCDG, which perform single-step
- Score: 14.12713117447183
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Single-cell RNA sequencing (scRNA-seq) technology has profiled hundreds of millions of human cells across organs, diseases, development and perturbations to date. However, the high-dimensional sparsity, batch effect noise, category imbalance, and ever-increasing data scale of the original sequencing data pose significant challenges for multi-center knowledge transfer, data fusion, and cross-validation between scRNA-seq datasets. To address these barriers, (1) we first propose a latent codes-based scRNA-seq dataset distillation framework named scDD, which transfers and distills foundation model knowledge and original dataset information into a compact latent space and generates synthetic scRNA-seq dataset by a generator to replace the original dataset. Then, (2) we propose a single-step conditional diffusion generator named SCDG, which perform single-step gradient back-propagation to help scDD optimize distillation quality and avoid gradient decay caused by multi-step back-propagation. Meanwhile, SCDG ensures the scRNA-seq data characteristics and inter-class discriminability of the synthetic dataset through flexible conditional control and generation quality assurance. Finally, we propose a comprehensive benchmark to evaluate the performance of scRNA-seq dataset distillation in different data analysis tasks. It is validated that our proposed method can achieve 7.61% absolute and 15.70% relative improvement over previous state-of-the-art methods on average task.
Related papers
- Benchmarking and optimizing organism wide single-cell RNA alignment methods [0.0]
We introduce the K-Neighbors Intersection (KNI) score, a single score that both penalizes batch effects and measures accuracy at cross-dataset cell-type label prediction.
We introduce Batch Adversarial single-cell Variational Inference (BA-scVI), as a new variant of scVI that uses adversarial training to penalize batch-effects in the encoder and decoder.
In the resulting aligned space, we find that the granularity of cell-type groupings is conserved, supporting the notion that whole-organism cell-type maps can be created by a single model without loss of information
arXiv Detail & Related papers (2025-03-26T17:11:47Z) - Enhanced ECG Arrhythmia Detection Accuracy by Optimizing Divergence-Based Data Fusion [5.575308369829893]
We propose a feature-based fusion algorithm utilizing Kernel Density Estimation (KDE) and Kullback-Leibler (KL) divergence.
Using our in-house datasets consisting of ECG signals collected from 2000 healthy and 2000 diseased individuals, we verify our method by using the publicly available PTB-XL dataset.
The results demonstrate that the proposed fusion method significantly enhances feature-based classification accuracy for abnormal ECG cases in the merged datasets.
arXiv Detail & Related papers (2025-03-19T12:16:48Z) - Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation [57.6797306341115]
We take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty.
We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods.
We introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality.
arXiv Detail & Related papers (2024-08-22T15:20:32Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - scRDiT: Generating single-cell RNA-seq data by diffusion transformers and accelerating sampling [9.013834280011293]
Single-cell RNA sequencing (scRNA-seq) is a groundbreaking technology extensively utilized in biological research.
Our study introduces a generative approach termed scRNA-seq Diffusion Transformer (scRDiT)
This method generates virtual scRNA-seq data by leveraging a real dataset.
arXiv Detail & Related papers (2024-04-09T09:25:16Z) - SIRST-5K: Exploring Massive Negatives Synthesis with Self-supervised
Learning for Robust Infrared Small Target Detection [53.19618419772467]
Single-frame infrared small target (SIRST) detection aims to recognize small targets from clutter backgrounds.
With the development of Transformer, the scale of SIRST models is constantly increasing.
With a rich diversity of infrared small target data, our algorithm significantly improves the model performance and convergence speed.
arXiv Detail & Related papers (2024-03-08T16:14:54Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - scDiffusion: conditional generation of high-quality single-cell data
using diffusion model [1.0738561302102216]
Single-cell RNA sequencing (scRNA-seq) data are important for studying the laws of life at single-cell level.
It is still challenging to obtain enough high-quality scRNA-seq data.
We developed scDiffusion, a generative model combining diffusion model and foundation model to generate high-quality scRNA-seq data.
arXiv Detail & Related papers (2024-01-08T15:44:39Z) - ScRAE: Deterministic Regularized Autoencoders with Flexible Priors for
Clustering Single-cell Gene Expression Data [11.511172015076532]
Clustering single-cell RNA sequence (scRNA-seq) data poses statistical and computational challenges.
Regularized Auto-Encoder (RAE) based deep neural network models have achieved remarkable success in learning robust low-dimensional representations.
We propose a modified RAE framework (called the scRAE) for effective clustering of the single-cell RNA sequencing data.
arXiv Detail & Related papers (2021-07-16T05:13:31Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Approximate kNN Classification for Biomedical Data [1.1852406625172218]
Single-cell RNA-seq (scRNA-seq) is an emerging DNA sequencing technology with promising capabilities but significant computational challenges.
We propose the utilization of approximate nearest neighbor search algorithms for the task of kNN classification in scRNA-seq data.
arXiv Detail & Related papers (2020-12-03T18:30:43Z) - TadGAN: Time Series Anomaly Detection Using Generative Adversarial
Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs)
To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics.
To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.