Handling Overlapping Asymmetric Datasets -- A Twice Penalized P-Spline
Approach
- URL: http://arxiv.org/abs/2311.10489v2
- Date: Mon, 20 Nov 2023 13:37:58 GMT
- Title: Handling Overlapping Asymmetric Datasets -- A Twice Penalized P-Spline
Approach
- Authors: Matthew McTeer, Robin Henderson, Quentin M Anstee, Paolo Missier
- Abstract summary: This research aims to develop a new method which can model the smaller cohort against a particular response.
We find our twice penalized approach offers an enhanced fit over a linear B-Spline and once penalized P-Spline approximation.
Applying to a real-life dataset relating to a person's risk of developing Non-Alcoholic Steatohepatitis, we see an improved model fit performance of over 65%.
- Score: 0.40964539027092917
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Overlapping asymmetric datasets are common in data science and pose questions
of how they can be incorporated together into a predictive analysis. In
healthcare datasets there is often a small amount of information that is
available for a larger number of patients such as an electronic health record,
however a small number of patients may have had extensive further testing.
Common solutions such as missing imputation can often be unwise if the smaller
cohort is significantly different in scale to the larger sample, therefore the
aim of this research is to develop a new method which can model the smaller
cohort against a particular response, whilst considering the larger cohort
also. Motivated by non-parametric models, and specifically flexible smoothing
techniques via generalized additive models, we model a twice penalized P-Spline
approximation method to firstly prevent over/under-fitting of the smaller
cohort and secondly to consider the larger cohort. This second penalty is
created through discrepancies in the marginal value of covariates that exist in
both the smaller and larger cohorts. Through data simulations, parameter
tunings and model adaptations to consider a continuous and binary response, we
find our twice penalized approach offers an enhanced fit over a linear B-Spline
and once penalized P-Spline approximation. Applying to a real-life dataset
relating to a person's risk of developing Non-Alcoholic Steatohepatitis, we see
an improved model fit performance of over 65%. Areas for future work within
this space include adapting our method to not require dimensionality reduction
and also consider parametric modelling methods. However, to our knowledge this
is the first work to propose additional marginal penalties in a flexible
regression of which we can report a vastly improved model fit that is able to
consider asymmetric datasets, without the need for missing data imputation.
Related papers
- Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Linked shrinkage to improve estimation of interaction effects in
regression models [0.0]
We develop an estimator that adapts well to two-way interaction terms in a regression model.
We evaluate the potential of the model for inference, which is notoriously hard for selection strategies.
Our models can be very competitive to a more advanced machine learner, like random forest, even for fairly large sample sizes.
arXiv Detail & Related papers (2023-09-25T10:03:39Z) - Multi-modality fusion using canonical correlation analysis methods:
Application in breast cancer survival prediction from histology and genomics [16.537929113715432]
We study the use of canonical correlation analysis (CCA) and penalized variants of CCA for the fusion of two modalities.
We analytically show that, with known model parameters, posterior mean estimators that jointly use both modalities outperform arbitrary linear mixing of single modality posterior estimators in latent variable prediction.
arXiv Detail & Related papers (2021-11-27T21:18:01Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - A Hamiltonian Monte Carlo Model for Imputation and Augmentation of
Healthcare Data [0.6719751155411076]
Missing values exist in nearly all clinical studies because data for a variable or question are not collected or not available.
Existing models usually do not consider privacy concerns or do not utilise the inherent correlations across multiple features to impute the missing values.
A Bayesian approach to impute missing values and creating augmented samples in high dimensional healthcare data is proposed in this work.
arXiv Detail & Related papers (2021-03-03T11:57:42Z) - Modern Multiple Imputation with Functional Data [6.624726878647541]
This work considers the problem of fitting functional models with sparsely and irregularly sampled functional data.
It overcomes the limitations of the state-of-the-art methods, which face major challenges in the fitting of more complex non-linear models.
arXiv Detail & Related papers (2020-11-25T04:22:30Z) - Dimensionality reduction, regularization, and generalization in
overparameterized regressions [8.615625517708324]
We show that PCA-OLS, also known as principal component regression, can be avoided with a dimensionality reduction.
We show that dimensionality reduction improves robustness while OLS is arbitrarily susceptible to adversarial attacks.
We find that methods in which the projection depends on the training data can outperform methods where the projections are chosen independently of the training data.
arXiv Detail & Related papers (2020-11-23T15:38:50Z) - Two-step penalised logistic regression for multi-omic data with an
application to cardiometabolic syndrome [62.997667081978825]
We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately.
Our approach should be preferred if the goal is to select as many relevant predictors as possible.
Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level.
arXiv Detail & Related papers (2020-08-01T10:36:27Z) - An Investigation of Why Overparameterization Exacerbates Spurious
Correlations [98.3066727301239]
We identify two key properties of the training data that drive this behavior.
We show how the inductive bias of models towards "memorizing" fewer examples can cause over parameterization to hurt.
arXiv Detail & Related papers (2020-05-09T01:59:13Z) - Machine learning for causal inference: on the use of cross-fit
estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties.
We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE)
When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.