Detecting Batch Heterogeneity via Likelihood Clustering
- URL: http://arxiv.org/abs/2601.09758v1
- Date: Wed, 14 Jan 2026 01:49:21 GMT
- Title: Detecting Batch Heterogeneity via Likelihood Clustering
- Authors: Austin Talbot, Yue Ke,
- Abstract summary: Batch effects represent a major confounder in genomic diagnostics.<n>We introduce a method that addresses both limitations by clustering samples according to their Bayesian model evidence.<n>Our method achieves superior clustering accuracy compared to standard correlation-based and dimensionality-reduction approaches.
- Score: 0.9668407688201359
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Batch effects represent a major confounder in genomic diagnostics. In copy number variant (CNV) detection from NGS, many algorithms compare read depth between test samples and a reference sample, assuming they are process-matched. When this assumption is violated, with causes ranging from reagent lot changes to multi-site processing, the reference becomes inappropriate, introducing false CNV calls or masking true pathogenic variants. Detecting such heterogeneity before downstream analysis is critical for reliable clinical interpretation. Existing batch effect detection methods either cluster samples based on raw features, risking conflation of biological signal with technical variation, or require known batch labels that are frequently unavailable. We introduce a method that addresses both limitations by clustering samples according to their Bayesian model evidence. The central insight is that evidence quantifies compatibility between data and model assumptions, technical artifacts violate assumptions and reduce evidence, whereas biological variation, including CNV status, is anticipated by the model and yields high evidence. This asymmetry provides a discriminative signal that separates batch effects from biology. We formalize heterogeneity detection as a likelihood ratio test for mixture structure in evidence space, using parametric bootstrap calibration to ensure conservative false positive rates. We validate our approach on synthetic data demonstrating proper Type I error control, three clinical targeted sequencing panels (liquid biopsy, BRCA, and thalassemia) exhibiting distinct batch effect mechanisms, and mouse electrophysiology recordings demonstrating cross-modality generalization. Our method achieves superior clustering accuracy compared to standard correlation-based and dimensionality-reduction approaches while maintaining the conservativeness required for clinical usage.
Related papers
- Controllable Generative Sandbox for Causal Inference [9.416664327739516]
CausalMix is a variational generative framework for causal inference.<n>It achieves state-of-the-art distributional metrics on mixed-type tables while providing stable, fine-grained causal control.<n>We demonstrate practical utility in a comparative safety study of metastatic castration-resistant prostate cancer treatments.
arXiv Detail & Related papers (2026-03-03T23:37:05Z) - Suppressing Prior-Comparison Hallucinations in Radiology Report Generation via Semantically Decoupled Latent Steering [94.37535002230504]
We develop a training-free, inference-time control framework termed Semantically Decoupled Latent Steering.<n>Our approach constructs a semantic-free intervention vector via large language model (LLM)-driven semantic decomposition.<n>We show that our approach significantly reduces the probability of historical hallucinations.
arXiv Detail & Related papers (2026-02-27T04:49:01Z) - Unlasting: Unpaired Single-Cell Multi-Perturbation Estimation by Dual Conditional Diffusion Implicit Bridges [68.98973318553983]
We propose a framework based on Dual Diffusion Implicit Bridges (DDIB) to learn the mapping between different data distributions.<n>We integrate gene regulatory network (GRN) information to propagate perturbation signals in a biologically meaningful way.<n>We also incorporate a masking mechanism to predict silent genes, improving the quality of generated profiles.
arXiv Detail & Related papers (2025-06-26T09:05:38Z) - A Robust Support Vector Machine Approach for Raman COVID-19 Data Classification [0.7864304771129751]
In this paper, we investigate the performance of a novel robust formulation for Support Vector Machine (SVM) in classifying COVID-19 samples obtained from Raman spectroscopy.<n>We derive robust counterpart models of deterministic formulations using bounded-by-norm uncertainty sets around each observation.<n>The effectiveness of our approach is validated on real-world COVID-19 datasets provided by Italian hospitals.
arXiv Detail & Related papers (2025-01-29T14:02:45Z) - scMEDAL for the interpretable analysis of single-cell transcriptomics data with batch effect visualization using a deep mixed effects autoencoder [3.194381706244149]
We propose scMEDAL, a single-cell Mixed Effects Deep Autoencoder Learning framework.<n> scMEDAL models batch-invariant and batch-specific effects using two complementaryworks.<n> scMEDAL produces interpretable, batch-specific embeddings that complement both scMEDAL-FE and established correction methods.
arXiv Detail & Related papers (2024-11-11T00:10:48Z) - Unlearnable Examples Detection via Iterative Filtering [84.59070204221366]
Deep neural networks are proven to be vulnerable to data poisoning attacks.
It is quite beneficial and challenging to detect poisoned samples from a mixed dataset.
We propose an Iterative Filtering approach for UEs identification.
arXiv Detail & Related papers (2024-08-15T13:26:13Z) - Conditionally Invariant Representation Learning for Disentangling
Cellular Heterogeneity [25.488181126364186]
This paper presents a novel approach that leverages domain variability to learn representations that are conditionally invariant to unwanted variability or distractors.
We apply our method to grand biological challenges, such as data integration in single-cell genomics.
Specifically, the proposed approach helps to disentangle biological signals from data biases that are unrelated to the target task or the causal explanation of interest.
arXiv Detail & Related papers (2023-07-02T12:52:41Z) - Bootstrapped Edge Count Tests for Nonparametric Two-Sample Inference
Under Heterogeneity [5.8010446129208155]
We develop a new nonparametric testing procedure that accurately detects differences between the two samples.
A comprehensive simulation study and an application to detecting user behaviors in online games demonstrates the excellent non-asymptotic performance of the proposed test.
arXiv Detail & Related papers (2023-04-26T22:25:44Z) - Rethinking Semi-Supervised Medical Image Segmentation: A
Variance-Reduction Perspective [51.70661197256033]
We propose ARCO, a semi-supervised contrastive learning framework with stratified group theory for medical image segmentation.
We first propose building ARCO through the concept of variance-reduced estimation and show that certain variance-reduction techniques are particularly beneficial in pixel/voxel-level segmentation tasks.
We experimentally validate our approaches on eight benchmarks, i.e., five 2D/3D medical and three semantic segmentation datasets, with different label settings.
arXiv Detail & Related papers (2023-02-03T13:50:25Z) - Benchmarking common uncertainty estimation methods with
histopathological images under domain shift and label noise [62.997667081978825]
In high-risk environments, deep learning models need to be able to judge their uncertainty and reject inputs when there is a significant chance of misclassification.
We conduct a rigorous evaluation of the most commonly used uncertainty and robustness methods for the classification of Whole Slide Images.
We observe that ensembles of methods generally lead to better uncertainty estimates as well as an increased robustness towards domain shifts and label noise.
arXiv Detail & Related papers (2023-01-03T11:34:36Z) - Hierarchical Semi-Supervised Contrastive Learning for
Contamination-Resistant Anomaly Detection [81.07346419422605]
Anomaly detection aims at identifying deviant samples from the normal data distribution.
Contrastive learning has provided a successful way to sample representation that enables effective discrimination on anomalies.
We propose a novel hierarchical semi-supervised contrastive learning framework, for contamination-resistant anomaly detection.
arXiv Detail & Related papers (2022-07-24T18:49:26Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.