Related papers: Efficient Data Selection for Training Genomic Perturbation Models

Efficient Data Selection for Training Genomic Perturbation Models

URL: http://arxiv.org/abs/2503.14571v5
Date: Wed, 06 Aug 2025 07:22:08 GMT
Title: Efficient Data Selection for Training Genomic Perturbation Models
Authors: George Panagopoulos, Johannes F. Lutzeyer, Sofiane Ennadir, Jun Pang,
Abstract summary: Gene perturbation models based on graph neural networks are trained to predict the outcomes of gene perturbations.<n>Active learning is often employed to train these models, alternating between wet-lab experiments and model updates.<n>We propose a graph-based data filtering method that selects the gene perturbations in one shot and in a model-free manner.
Score: 8.362190332905524
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Genomic studies, including CRISPR-based Perturb-seq analyses, face a vast hypothesis space, while gene perturbations remain costly and time-consuming. Gene perturbation models based on graph neural networks are trained to predict the outcomes of gene perturbations to facilitate such experiments. Due to the cost of genomic experiments, active learning is often employed to train these models, alternating between wet-lab experiments and model updates. However, the operational constraints of the wet-lab and the iterative nature of active learning significantly increase the total training time. Furthermore, the inherent sensitivity to model initialization can lead to markedly different sets of gene perturbations across runs, which undermines the reproducibility, interpretability, and reusability of the method. To this end, we propose a graph-based data filtering method that, unlike active learning, selects the gene perturbations in one shot and in a model-free manner. The method optimizes a criterion that maximizes the supervision signal from the graph neural network to enhance generalization. The criterion is defined over the input graph and is optimized with submodular maximization. We compare it empirically to active learning, and the results demonstrate that despite yielding months of acceleration, it also improves the stability of the selected perturbation experiments while achieving comparable test error.

Related papers

Modeling Gene Expression Distributional Shifts for Unseen Genetic Perturbations [44.619690829431214]
We train a neural network to predict distributional responses in gene expression following genetic perturbations.<n>Our model predicts gene-level histograms conditioned on perturbations and outperforms baselines in capturing higher-order statistics.
arXiv Detail & Related papers (2025-07-01T06:04:28Z)
NOBLE -- Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models [68.89389652724378]
NOBLE is a neural operator framework that learns a mapping from a continuous frequency-modulated embedding of interpretable neuron features to the somatic voltage response induced by current injection.<n>It predicts distributions of neural dynamics accounting for the intrinsic experimental variability.<n>NOBLE is the first scaled-up deep learning framework validated on real experimental data.
arXiv Detail & Related papers (2025-06-05T01:01:18Z)
BLEND: Behavior-guided Neural Population Dynamics Modeling via Privileged Knowledge Distillation [6.3559178227943764]
We propose BLEND, a behavior-guided neural population dynamics modeling framework via privileged knowledge distillation.<n>By considering behavior as privileged information, we train a teacher model that takes both behavior observations (privileged features) and neural activities (regular features) as inputs.<n>A student model is then distilled using only neural activity.
arXiv Detail & Related papers (2024-10-02T12:45:59Z)
Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues. We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space. A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z)
Comparative Analysis of Data Preprocessing Methods, Feature Selection Techniques and Machine Learning Models for Improved Classification and Regression Performance on Imbalanced Genetic Data [0.0]
We investigated the effects of data preprocessing, feature selection techniques, and model selection on the performance of models trained on genetic datasets. We found that outliers/skew in predictor or target variables did not pose a challenge to regression models. We also found that class-imbalanced target variables and skewed predictors had little to no impact on classification performance.
arXiv Detail & Related papers (2024-02-22T21:41:27Z)
Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data [3.46029409929709]
State-of-the-art systems neuroscience experiments yield large-scale multimodal data, and these data sets require new tools for analysis. Inspired by the success of large pretrained models in vision and language domains, we reframe the analysis of large-scale, cellular-resolution neuronal spiking data into an autoregressive generation problem. We first trained Neuroformer on simulated datasets, and found that it both accurately predicted intrinsically simulated neuronal circuit activity, and also inferred the underlying neural circuit connectivity, including direction.
arXiv Detail & Related papers (2023-10-31T20:17:32Z)
ASPEST: Bridging the Gap Between Active Learning and Selective Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain. Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples. In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z)
Genetic Imitation Learning by Reward Extrapolation [6.340280403330784]
We propose a method called GenIL that integrates the Genetic Algorithm with imitation learning. The involvement of the Genetic Algorithm improves the data efficiency by reproducing trajectories with various returns. We tested GenIL in both Atari and Mujoco domains, and the result shows that it successfully outperforms the previous methods.
arXiv Detail & Related papers (2023-01-03T14:12:28Z)
Active Learning for Single Neuron Models with Lipschitz Non-Linearities [35.119032992898774]
We consider the problem of active learning for single neuron models. We show that for a single neuron model with any Lipschitz non-linearity, strong provable approximation guarantees can be obtained.
arXiv Detail & Related papers (2022-10-24T20:55:21Z)
On the Generalization and Adaption Performance of Causal Models [99.64022680811281]
Differentiable causal discovery has proposed to factorize the data generating process into a set of modules. We study the generalization and adaption performance of such modular neural causal models. Our analysis shows that the modular neural causal models outperform other models on both zero and few-shot adaptation in low data regimes.
arXiv Detail & Related papers (2022-06-09T17:12:32Z)
Deep neural networks with controlled variable selection for the identification of putative causal genetic variants [0.43012765978447565]
We propose an interpretable neural network model, stabilized using ensembling, with controlled variable selection for genetic studies. The merit of the proposed method includes: (1) flexible modelling of the non-linear effect of genetic variants to improve statistical power; (2) multiple knockoffs in the input layer to rigorously control false discovery rate; (3) hierarchical layers to substantially reduce the number of weight parameters and activations to improve computational efficiency.
arXiv Detail & Related papers (2021-09-29T20:57:48Z)
Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation. We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)
Variable selection with missing data in both covariates and outcomes: Imputation and machine learning [1.0333430439241666]
The missing data issue is ubiquitous in health studies. Machine learning methods weaken parametric assumptions. XGBoost and BART have the overall best performance across various settings.
arXiv Detail & Related papers (2021-04-06T20:18:29Z)
Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z)
Deep Low-Shot Learning for Biological Image Classification and Visualization from Limited Training Samples [52.549928980694695]
In situ hybridization (ISH) gene expression pattern images from the same developmental stage are compared. labeling training data with precise stages is very time-consuming even for biologists. We propose a deep two-step low-shot learning framework to accurately classify ISH images using limited training images.
arXiv Detail & Related papers (2020-10-20T06:06:06Z)
Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients. We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks. Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation [51.091890311312085]
We propose a new training scheme for auto-regressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation. Our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.
arXiv Detail & Related papers (2020-07-12T15:31:24Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
Stochasticity in Neural ODEs: An Empirical Study [68.8204255655161]
Regularization of neural networks (e.g. dropout) is a widespread technique in deep learning that allows for better generalization. We show that data augmentation during the training improves the performance of both deterministic and versions of the same model. However, the improvements obtained by the data augmentation completely eliminate the empirical regularization gains, making the performance of neural ODE and neural SDE negligible.
arXiv Detail & Related papers (2020-02-22T22:12:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.