CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis
- URL: http://arxiv.org/abs/2509.06986v1
- Date: Tue, 02 Sep 2025 03:30:07 GMT
- Title: CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis
- Authors: Cedric Caruzzo, Jong Chul Ye,
- Abstract summary: We introduce CellPainTR, a Transformer-based architecture designed to learn foundational representations of cellular morphology.<n>Our work represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.
- Score: 51.56484100374058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale biological discovery requires integrating massive, heterogeneous datasets like those from the JUMP Cell Painting consortium, but technical batch effects and a lack of generalizable models remain critical roadblocks. To address this, we introduce CellPainTR, a Transformer-based architecture designed to learn foundational representations of cellular morphology that are robust to batch effects. Unlike traditional methods that require retraining on new data, CellPainTR's design, featuring source-specific context tokens, allows for effective out-of-distribution (OOD) generalization to entirely unseen datasets without fine-tuning. We validate CellPainTR on the large-scale JUMP dataset, where it outperforms established methods like ComBat and Harmony in both batch integration and biological signal preservation. Critically, we demonstrate its robustness through a challenging OOD task on the unseen Bray et al. dataset, where it maintains high performance despite significant domain and feature shifts. Our work represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.
Related papers
- DOGMA: Weaving Structural Information into Data-centric Single-cell Transcriptomics Analysis [43.565183518761984]
We propose DOGMA, a data-centric framework designed for the structural reshaping and semantic enhancement of raw data.<n>In complex multi-species and multi-organ benchmarks, DOGMA SOTA performance, exhibiting superior zero-shot robustness and sample efficiency.
arXiv Detail & Related papers (2026-02-02T09:10:09Z) - scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration [53.683726781791385]
We introduce a scalable and flexible generative framework called single-cell Multi-omics Regularized Disentangled Representations (scMRDR) for unpaired multi-omics integration.<n>Our method achieves excellent performance on benchmark datasets in terms of batch correction, modality alignment, and biological signal preservation.
arXiv Detail & Related papers (2025-10-28T21:28:39Z) - Comparative Analysis of Deep Learning Strategies for Hypertensive Retinopathy Detection from Fundus Images: From Scratch and Pre-trained Models [5.860609259063137]
This paper presents a comparative analysis of deep learning strategies for detecting hypertensive retinopathy from fundus images.<n>We investigate three distinct approaches: a custom CNN, a suite of pre-trained transformer-based models, and an AutoML solution.
arXiv Detail & Related papers (2025-06-14T13:11:33Z) - Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen [76.02070962797794]
This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data.<n>CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Deep models for stroke segmentation: do complex architectures always perform better? [1.4651272514940197]
Stroke segmentation plays a crucial role in the diagnosis and treatment of stroke patients.
Deep models have been introduced for general medical image segmentation.
In this study, we selected four types of deep models that were recently proposed and evaluated their performance for stroke segmentation.
arXiv Detail & Related papers (2024-03-25T20:44:01Z) - Cell Variational Information Bottleneck Network [6.164295534465283]
We propose a convolutional neural network using information bottleneck mechanism, which can be combined with the latest feedforward network architecture.
Cell Variational Information Bottleneck Network is constructed by stacking VIB cells, which generate feature maps with uncertainty.
In a more complex representation learning task, face recognition, our network structure has also achieved very competitive results.
arXiv Detail & Related papers (2024-03-22T10:06:31Z) - UniCell: Universal Cell Nucleus Classification via Prompt Learning [76.11864242047074]
We propose a universal cell nucleus classification framework (UniCell)
It employs a novel prompt learning mechanism to uniformly predict the corresponding categories of pathological images from different dataset domains.
In particular, our framework adopts an end-to-end architecture for nuclei detection and classification, and utilizes flexible prediction heads for adapting various datasets.
arXiv Detail & Related papers (2024-02-20T11:50:27Z) - BioREx: Improving Biomedical Relation Extraction by Leveraging
Heterogeneous Datasets [7.7587371896752595]
Biomedical relation extraction (RE) is a central task in biomedical natural language processing (NLP) research.
We present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset.
Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset.
arXiv Detail & Related papers (2023-06-19T22:48:18Z) - AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context
Processing for Representation Learning of Giga-pixel Images [53.29794593104923]
We present a novel concept of shared-context processing for whole slide histopathology images.
AMIGO uses the celluar graph within the tissue to provide a single representation for a patient.
We show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data.
arXiv Detail & Related papers (2023-03-01T23:37:45Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.