Related papers: Kernel-Based Evaluation of Conditional Biological Sequence Models

Kernel-Based Evaluation of Conditional Biological Sequence Models

URL: http://arxiv.org/abs/2510.15601v1
Date: Fri, 17 Oct 2025 12:47:51 GMT
Title: Kernel-Based Evaluation of Conditional Biological Sequence Models
Authors: Pierre Glaser, Steffanie Paul, Alissa M. Hummer, Charlotte M. Deane, Debora S. Marks, Alan N. Amin,
Abstract summary: We propose a set of kernel-based tools to evaluate the designs and tune the hyper parameters of conditional sequence models.<n>The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model's estimate, called the Augmented Maximum Mean Discrepancy (ACMMD)<n>We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN.
Score: 8.322729112426819
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose a set of kernel-based tools to evaluate the designs and tune the hyperparameters of conditional sequence models, with a focus on problems in computational biology. The backbone of our tools is a new measure of discrepancy between the true conditional distribution and the model's estimate, called the Augmented Conditional Maximum Mean Discrepancy (ACMMD). Provided that the model can be sampled from, the ACMMD can be estimated unbiasedly from data to quantify absolute model fit, integrated within hypothesis tests, and used to evaluate model reliability. We demonstrate the utility of our approach by analyzing a popular protein design model, ProteinMPNN. We are able to reject the hypothesis that ProteinMPNN fits its data for various protein families, and tune the model's temperature hyperparameter to achieve a better fit.

Related papers

Uncertainty-Guided Model Selection for Tabular Foundation Models in Biomolecule Efficacy Prediction [3.108481950101193]
In this study, we investigate an uncertainty-guided strategy for model selection.<n>We show that a TabPFN model using straightforward sequence-based features can surpass specialized state-of-the-art predictors.
arXiv Detail & Related papers (2025-10-02T18:33:19Z)
Model Correlation Detection via Random Selection Probing [62.093777777813756]
Existing similarity-based methods require access to model parameters or produce scores without thresholds.<n>We introduce Random Selection Probing (RSP), a hypothesis-testing framework that formulates model correlation detection as a statistical test.<n>RSP produces rigorous p-values that quantify evidence of correlation.
arXiv Detail & Related papers (2025-09-29T01:40:26Z)
Robust Spatiotemporal Epidemic Modeling with Integrated Adaptive Outlier Detection [7.5504472850103435]
In epidemic modeling, outliers can distort parameter estimation and lead to misguided public health decisions.<n>We introduce a robust generalized additive model (RST-GAM) to mitigate this distortion.<n>We demonstrate the practical utility of RST-GAM by analyzing county-level COVID-19 infection data in the United States.
arXiv Detail & Related papers (2025-07-12T19:23:25Z)
Model-free Methods for Event History Analysis and Efficient Adjustment (PhD Thesis) [55.2480439325792]
This thesis is a series of independent contributions to statistics unified by a model-free perspective.<n>The first chapter elaborates on how a model-free perspective can be used to formulate flexible methods that leverage prediction techniques from machine learning.<n>The second chapter studies the concept of local independence, which describes whether the evolution of one process is directly influenced by another.
arXiv Detail & Related papers (2025-02-11T19:24:09Z)
Tests for model misspecification in simulation-based inference: from local distortions to global model checks [2.0209172586699173]
We provide a solid and flexible foundation for a wide range of model discrepancy analysis tasks.<n>We make explicit analytic connections to classical techniques: anomaly detection, model validation, and goodness-of-fit residual analysis.<n>We show how to conduct such a distortion-driven model misspecification test for real gravitational wave data, specifically on the event GW150914.
arXiv Detail & Related papers (2024-12-19T17:48:03Z)
Adaptive Nonparametric Perturbations of Parametric Bayesian Models [33.85958872117418]
We study nonparametrically perturbed parametric (NPP) Bayesian models, in which a parametric Bayesian model is relaxed via a distortion of its likelihood.<n>We show that NPP models can offer the robustness of non models while retaining the data efficiency of parametric models.<n>We demonstrate our method by estimating causal effects of gene expression from single cell RNA sequencing data.
arXiv Detail & Related papers (2024-12-14T05:06:38Z)
Stable Training of Probabilistic Models Using the Leave-One-Out Maximum Log-Likelihood Objective [0.7373617024876725]
Kernel density estimation (KDE) based models are popular choices for this task, but they fail to adapt to data regions with varying densities. An adaptive KDE model is employed to circumvent this, where each kernel in the model has an individual bandwidth. A modified expectation-maximization algorithm is employed to accelerate the optimization speed reliably.
arXiv Detail & Related papers (2023-10-05T14:08:42Z)
Conditional Korhunen-Lo\'{e}ve regression model with Basis Adaptation for high-dimensional problems: uncertainty quantification and inverse modeling [62.997667081978825]
We propose a methodology for improving the accuracy of surrogate models of the observable response of physical systems. We apply the proposed methodology to constructing surrogate models via the Basis Adaptation (BA) method of the stationary hydraulic head response.
arXiv Detail & Related papers (2023-07-05T18:14:38Z)
Generalization Metrics for Practical Quantum Advantage in Generative Models [68.8204255655161]
Generative modeling is a widely accepted natural use case for quantum computers. We construct a simple and unambiguous approach to probe practical quantum advantage for generative modeling by measuring the algorithm's generalization performance. Our simulation results show that our quantum-inspired models have up to a $68 times$ enhancement in generating unseen unique and valid samples.
arXiv Detail & Related papers (2022-01-21T16:35:35Z)
How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion. Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z)
Leveraging Global Parameters for Flow-based Neural Posterior Estimation [90.21090932619695]
Inferring the parameters of a model based on experimental observations is central to the scientific method. A particularly challenging setting is when the model is strongly indeterminate, i.e., when distinct sets of parameters yield identical observations. We present a method for cracking such indeterminacy by exploiting additional information conveyed by an auxiliary set of observations sharing global parameters.
arXiv Detail & Related papers (2021-02-12T12:23:13Z)
Autoregressive Score Matching [113.4502004812927]
We propose autoregressive conditional score models (AR-CSM) where we parameterize the joint distribution in terms of the derivatives of univariable log-conditionals (scores) For AR-CSM models, this divergence between data and model distributions can be computed and optimized efficiently, requiring no expensive sampling or adversarial training. We show with extensive experimental results that it can be applied to density estimation on synthetic data, image generation, image denoising, and training latent variable models with implicit encoders.
arXiv Detail & Related papers (2020-10-24T07:01:24Z)
Unravelling the Architecture of Membrane Proteins with Conditional Random Fields [11.321552104966326]
We will show that the Conditional Random Fields (CRF) provides a template to integrate micro-level information about biological entities into a mathematical model to understand their macro-level behavior. A comparison on benchmark data sets against twenty-eight other methods shows that the CRF model leads to extremely accurate predictions.
arXiv Detail & Related papers (2020-08-06T05:57:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.