DeepSeq: High-Throughput Single-Cell RNA Sequencing Data Labeling via Web Search-Augmented Agentic Generative AI Foundation Models
- URL: http://arxiv.org/abs/2506.13817v1
- Date: Sat, 14 Jun 2025 23:30:22 GMT
- Title: DeepSeq: High-Throughput Single-Cell RNA Sequencing Data Labeling via Web Search-Augmented Agentic Generative AI Foundation Models
- Authors: Saleem A. Al Dajani, Abel Sanchez, John R. Williams,
- Abstract summary: Generative AI foundation models offer transformative potential for processing structured biological data.<n>We propose the use of agentic foundation models with real-time web search to automate the labeling of experimental data, achieving up to 82.5% accuracy.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative AI foundation models offer transformative potential for processing structured biological data, particularly in single-cell RNA sequencing, where datasets are rapidly scaling toward billions of cells. We propose the use of agentic foundation models with real-time web search to automate the labeling of experimental data, achieving up to 82.5% accuracy. This addresses a key bottleneck in supervised learning for structured omics data by increasing annotation throughput without manual curation and human error. Our approach enables the development of virtual cell foundation models capable of downstream tasks such as cell-typing and perturbation prediction. As data volume grows, these models may surpass human performance in labeling, paving the way for reliable inference in large-scale perturbation screens. This application demonstrates domain-specific innovation in health monitoring and diagnostics, aligned with efforts like the Human Cell Atlas and Human Tumor Atlas Network.
Related papers
- Private Training & Data Generation by Clustering Embeddings [74.00687214400021]
Differential privacy (DP) provides a robust framework for protecting individual data.<n>We introduce a novel principled method for DP synthetic image embedding generation.<n> Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy.
arXiv Detail & Related papers (2025-06-20T00:17:14Z) - Comparative Analysis of Deep Learning Strategies for Hypertensive Retinopathy Detection from Fundus Images: From Scratch and Pre-trained Models [5.860609259063137]
This paper presents a comparative analysis of deep learning strategies for detecting hypertensive retinopathy from fundus images.<n>We investigate three distinct approaches: a custom CNN, a suite of pre-trained transformer-based models, and an AutoML solution.
arXiv Detail & Related papers (2025-06-14T13:11:33Z) - MEDS-Tab: Automated tabularization and baseline methods for MEDS datasets [2.8209943093430443]
This work is powered by complementary advances in core data standardization through the MEDS framework.
We dramatically simplify and accelerate this process of scalably featurizing irregularly sampled time-series data.
This system will greatly enhance the reliability, scalable, and ease of development of powerful ML solutions for health problems across diverse datasets and clinical settings.
arXiv Detail & Related papers (2024-10-31T20:36:37Z) - Data Quality Monitoring through Transfer Learning on Anomaly Detection for the Hadron Calorimeters [0.36070136675401654]
Transfer learning (TL) mechanisms promise to mitigate data sparsity and model complexity by utilizing pre-trained models for a new task.
We present the potential of TL within the context of anomaly detection for the Hadron Calorimeter of the Compact Muon Solenoid experiment at CERN.
Experiment results demonstrate that TL effectively enhances the model learning accuracy on a target subdetector.
arXiv Detail & Related papers (2024-08-29T15:19:06Z) - Large-Scale Targeted Cause Discovery with Data-Driven Learning [66.86881771339145]
We propose a novel machine learning approach for inferring causal variables of a target variable from observations.<n>By employing a local-inference strategy, our approach scales with linear complexity in the number of variables, efficiently scaling up to thousands of variables.<n> Empirical results demonstrate superior performance in identifying causal relationships within large-scale gene regulatory networks.
arXiv Detail & Related papers (2024-08-29T02:21:11Z) - Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen [76.02070962797794]
This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data.<n>CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - The Imaginative Generative Adversarial Network: Automatic Data
Augmentation for Dynamic Skeleton-Based Hand Gesture and Human Action
Recognition [27.795763107984286]
We present a novel automatic data augmentation model, which approximates the distribution of the input data and samples new data from this distribution.
Our results show that the augmentation strategy is fast to train and can improve classification accuracy for both neural networks and state-of-the-art methods.
arXiv Detail & Related papers (2021-05-27T11:07:09Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.