Robust Machine Learning for Regulatory Sequence Modeling under Biological and Technical Distribution Shifts
- URL: http://arxiv.org/abs/2601.14969v1
- Date: Wed, 21 Jan 2026 13:15:27 GMT
- Title: Robust Machine Learning for Regulatory Sequence Modeling under Biological and Technical Distribution Shifts
- Authors: Yiyao Yang,
- Abstract summary: We introduce a robustness framework to quantify performance degradation, calibration failures, and uncertainty based reliability.<n>In simulation, motif driven regulatory outputs are generated with cell type specific programs, perturbations, GC bias, depth variation, batch effects, and heteroscedastic noise.<n>Models remain accurate but show higher error, severe variance miscalibration, and coverage collapse under motif effect rewiring and noise dominated regimes.
- Score: 0.3948325938742681
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Robust machine learning for regulatory genomics is studied under biologically and technically induced distribution shifts. Deep convolutional and attention based models achieve strong in distribution performance on DNA regulatory sequence prediction tasks but are usually evaluated under i.i.d. assumptions, even though real applications involve cell type specific programs, evolutionary turnover, assay protocol changes, and sequencing artifacts. We introduce a robustness framework that combines a mechanistic simulation benchmark with real data analysis on a massively parallel reporter assay (MPRA) dataset to quantify performance degradation, calibration failures, and uncertainty based reliability. In simulation, motif driven regulatory outputs are generated with cell type specific programs, PWM perturbations, GC bias, depth variation, batch effects, and heteroscedastic noise, and CNN, BiLSTM, and transformer models are evaluated. Models remain accurate and reasonably calibrated under mild GC content shifts but show higher error, severe variance miscalibration, and coverage collapse under motif effect rewiring and noise dominated regimes, revealing robustness gaps invisible to standard i.i.d. evaluation. Adding simple biological structural priors motif derived features in simulation and global GC content in MPRA improves in distribution error and yields consistent robustness gains under biologically meaningful genomic shifts, while providing only limited protection against strong assay noise. Uncertainty-aware selective prediction offers an additional safety layer that risk coverage analyses on simulated and MPRA data show that filtering low confidence inputs recovers low risk subsets, including under GC-based out-of-distribution conditions, although reliability gains diminish when noise dominates.
Related papers
- scDFM: Distributional Flow Matching Model for Robust Single-Cell Perturbation Prediction [12.48933770510505]
We present scDFM, a generative framework based on conditional flow matching.<n> scDFM aligns perturbed and control populations beyond cell-level correspondences.
arXiv Detail & Related papers (2026-02-06T17:00:21Z) - Latent Causal Diffusions for Single-Cell Perturbation Modeling [83.47931153555321]
We present a generative model that frames single-cell gene expression as a stationary diffusion process observed under measurement noise.<n> LCD outperforms established approaches in predicting the distributional shifts of unseen perturbation combinations in single-cell RNA-sequencing screens.<n>We develop an approach we call causal linearization via perturbation responses (CLIPR), which yields an approximation of the direct causal effects between all genes.
arXiv Detail & Related papers (2026-01-20T16:15:38Z) - Noisy Analysis of Quantum SMOTE on Condition Monitoring and Fault Classification in Industrial and Energy Systems [0.5505634045241289]
Imbalanced machine learning models are a fundamental issue in industrial condition monitoring and fault classification pipelines.<n>This work presents a detailed benchmarking and investigation of classical classifiers under class imbalance mitigation.<n>The results show that QSMOTE consistently corrects distributional skew and significantly enhances the performance of non-linear classifiers.
arXiv Detail & Related papers (2026-01-16T16:44:38Z) - SIGMA: Scalable Spectral Insights for LLM Collapse [51.863164847253366]
We introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework for model collapse.<n>By utilizing benchmarks that deriving and deterministic bounds on the matrix's spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space.<n>We demonstrate that SIGMA effectively captures the transition towards states, offering both theoretical insights into the mechanics of collapse.
arXiv Detail & Related papers (2026-01-06T19:47:11Z) - Locally Adaptive Conformal Inference for Operator Models [5.78532405664684]
We introduce Local Sliced Conformal Inference (LSCI), a distribution-free framework for generating function-valued locally adaptive prediction sets for operator models.<n>We prove finite-sample validity and derive a data-dependent upper bound on the coverage gap under local exchangeability.<n>We empirically demonstrate spaces against biased predictions and certain out-of-distribution noise regimes.
arXiv Detail & Related papers (2025-07-28T16:37:56Z) - Generate Aligned Anomaly: Region-Guided Few-Shot Anomaly Image-Mask Pair Synthesis for Industrial Inspection [53.137651284042434]
Anomaly inspection plays a vital role in industrial manufacturing, but the scarcity of anomaly samples limits the effectiveness of existing methods.<n>We propose Generate grained Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework.<n>GAA generates realistic, diverse, and semantically aligned anomalies using only a small number of samples.
arXiv Detail & Related papers (2025-07-13T12:56:59Z) - DISPROTBENCH: A Disorder-Aware, Task-Rich Benchmark for Evaluating Protein Structure Prediction in Realistic Biological Contexts [76.59606029593085]
DisProtBench is a benchmark for evaluating protein structure prediction models (PSPMs) under structural disorder and complex biological conditions.<n>DisProtBench spans three key axes: data complexity, task diversity, and Interpretability.<n>Results reveal significant variability in model robustness under disorder, with low-confidence regions linked to functional prediction failures.
arXiv Detail & Related papers (2025-06-18T23:58:22Z) - Statistical Management of the False Discovery Rate in Medical Instance Segmentation Based on Conformal Risk Control [2.4578723416255754]
Instance segmentation plays a pivotal role in medical image analysis by enabling precise localization and delineation of lesions, tumors, and anatomical structures.<n>Deep learning models such as Mask R-CNN and BlendMask have achieved remarkable progress, but their application in high-risk medical scenarios remains constrained by confidence calibration issues.<n>We propose a robust quality control framework based on conformal prediction theory to address this challenge.
arXiv Detail & Related papers (2025-04-06T13:31:19Z) - Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z) - Generative Principal Component Regression via Variational Inference [2.4415762506639944]
One approach to designing appropriate manipulations is to target key features of predictive models.
We develop a novel objective based on supervised variational autoencoders (SVAEs) that enforces such information is represented in the latent space.
We show in simulations that gPCR dramatically improves target selection in manipulation as compared to standard PCR and SVAEs.
arXiv Detail & Related papers (2024-09-03T22:38:55Z) - Gaussian Process-based Min-norm Stabilizing Controller for
Control-Affine Systems with Uncertain Input Effects and Dynamics [90.81186513537777]
We propose a novel compound kernel that captures the control-affine nature of the problem.
We show that this resulting optimization problem is convex, and we call it Gaussian Process-based Control Lyapunov Function Second-Order Cone Program (GP-CLF-SOCP)
arXiv Detail & Related papers (2020-11-14T01:27:32Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.