On Procrustes Contamination in Machine Learning Applications of Geometric Morphometrics
- URL: http://arxiv.org/abs/2601.18448v1
- Date: Mon, 26 Jan 2026 12:56:23 GMT
- Title: On Procrustes Contamination in Machine Learning Applications of Geometric Morphometrics
- Authors: Lloyd Austin Courtenay,
- Abstract summary: Geometric morphometrics (GMM) is widely used to quantify shape variation, more recently serving as input for machine learning (ML) analyses.<n>Standard practice aligns all specimens via Generalized Procrustes Analysis (GPA) prior to splitting data into training and test sets.<n>Here, the effects of GPA-induced contamination are formally characterised using controlled 2D and 3D simulations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Geometric morphometrics (GMM) is widely used to quantify shape variation, more recently serving as input for machine learning (ML) analyses. Standard practice aligns all specimens via Generalized Procrustes Analysis (GPA) prior to splitting data into training and test sets, potentially introducing statistical dependence and contaminating downstream predictive models. Here, the effects of GPA-induced contamination are formally characterised using controlled 2D and 3D simulations across varying sample sizes, landmark densities, and allometric patterns. A novel realignment procedure is proposed, whereby test specimens are aligned to the training set prior to model fitting, eliminating cross-sample dependency. Simulations reveal a robust "diagonal" in sample-size vs. landmark-space, reflecting the scaling of RMSE under isotropic variation, with slopes analytically derived from the degrees of freedom in Procrustes tangent space. The importance of spatial autocorrelation among landmarks is further demonstrated using linear and convolutional regression models, highlighting performance degradation when landmark relationships are ignored. This work establishes the need for careful preprocessing in ML applications of GMM, provides practical guidelines for realignment, and clarifies fundamental statistical constraints inherent to Procrustes shape space.
Related papers
- Simulation-Based Inference via Regression Projection and Batched Discrepancies [1.9435397960631862]
We analyze a lightweight simulation-based inference method that infers simulator parameters using only a regression-based projection of the observed data.<n> Experiments on a tractable nonlinear model and on a cosmological calibration task using the DREAMS simulation suite illustrate the computational advantages of regression-based projections.
arXiv Detail & Related papers (2026-02-03T15:07:40Z) - Prediction of Fault Slip Tendency in CO${_2}$ Storage using Data-space Inversion [0.0]
We implement a variational autoencoder (VAE)-based data-space inversion (DSI) framework to predict pressure, stress and strain fields, and fault slip tendency.<n>The DSI-VAE framework is shown to give accurate predictions for pressure, strain, and stress fields and for fault slip tendency.
arXiv Detail & Related papers (2026-01-08T23:41:04Z) - SIGMA: Scalable Spectral Insights for LLM Collapse [51.863164847253366]
We introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework for model collapse.<n>By utilizing benchmarks that deriving and deterministic bounds on the matrix's spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space.<n>We demonstrate that SIGMA effectively captures the transition towards states, offering both theoretical insights into the mechanics of collapse.
arXiv Detail & Related papers (2026-01-06T19:47:11Z) - On metric choice in dimension reduction for Fréchet regression [7.161207910629032]
Fr'echet regression is becoming a mainstay in modern data analysis for analyzing non-traditional data types.
It is especially useful in the analysis of complex health data such as continuous monitoring and imaging data.
arXiv Detail & Related papers (2024-10-02T17:39:34Z) - Diffusion posterior sampling for simulation-based inference in tall data settings [53.17563688225137]
Simulation-based inference ( SBI) is capable of approximating the posterior distribution that relates input parameters to a given observation.
In this work, we consider a tall data extension in which multiple observations are available to better infer the parameters of the model.
We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost.
arXiv Detail & Related papers (2024-04-11T09:23:36Z) - Perturbative partial moment matching and gradient-flow adaptive importance sampling transformations for Bayesian leave one out cross-validation [0.9895793818721335]
We motivate the use of perturbative transformations of the form $T(boldsymboltheta)=boldsymboltheta + h Q(boldsymboltheta),$ for $0hll 1,$.<n>We derive closed-form expressions in the case of logistic regression and shallow ReLU activated neural networks.
arXiv Detail & Related papers (2024-02-13T01:03:39Z) - Temporal-spatial model via Trend Filtering [12.875863572064986]
This research focuses on the estimation of a non-parametric regression function designed for data with simultaneous time and space dependencies.
A unique phase transition phenomenon, previously uncharted in Trend Filtering studies, emerges through our analysis.
arXiv Detail & Related papers (2023-08-30T17:50:00Z) - Large-scale gradient-based training of Mixtures of Factor Analyzers [67.21722742907981]
This article contributes both a theoretical analysis as well as a new method for efficient high-dimensional training by gradient descent.
We prove that MFA training and inference/sampling can be performed based on precision matrices, which does not require matrix inversions after training is completed.
Besides the theoretical analysis and matrices, we apply MFA to typical image datasets such as SVHN and MNIST, and demonstrate the ability to perform sample generation and outlier detection.
arXiv Detail & Related papers (2023-08-26T06:12:33Z) - Conditional Korhunen-Lo\'{e}ve regression model with Basis Adaptation
for high-dimensional problems: uncertainty quantification and inverse
modeling [62.997667081978825]
We propose a methodology for improving the accuracy of surrogate models of the observable response of physical systems.
We apply the proposed methodology to constructing surrogate models via the Basis Adaptation (BA) method of the stationary hydraulic head response.
arXiv Detail & Related papers (2023-07-05T18:14:38Z) - Data-driven reduced-order modelling for blood flow simulations with
geometry-informed snapshots [0.0]
A data-driven surrogate model is proposed for the efficient prediction of blood flow simulations on similar but distinct domains.
A non-intrusive reduced-order model for geometrical parameters is constructed using proper decomposition.
A radial basis function interpolator is trained for predicting the reduced coefficients of the reduced-order model.
arXiv Detail & Related papers (2023-02-21T21:18:17Z) - Mixed Effects Neural ODE: A Variational Approximation for Analyzing the
Dynamics of Panel Data [50.23363975709122]
We propose a probabilistic model called ME-NODE to incorporate (fixed + random) mixed effects for analyzing panel data.
We show that our model can be derived using smooth approximations of SDEs provided by the Wong-Zakai theorem.
We then derive Evidence Based Lower Bounds for ME-NODE, and develop (efficient) training algorithms.
arXiv Detail & Related papers (2022-02-18T22:41:51Z) - Inverse Learning of Symmetries [71.62109774068064]
We learn the symmetry transformation with a model consisting of two latent subspaces.
Our approach is based on the deep information bottleneck in combination with a continuous mutual information regulariser.
Our model outperforms state-of-the-art methods on artificial and molecular datasets.
arXiv Detail & Related papers (2020-02-07T13:48:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.