DPERC: Direct Parameter Estimation for Mixed Data
- URL: http://arxiv.org/abs/2501.10540v1
- Date: Fri, 17 Jan 2025 20:24:40 GMT
- Title: DPERC: Direct Parameter Estimation for Mixed Data
- Authors: Tuan L. Vo, Quan Huu Do, Uyen Dang, Thu Nguyen, Pål Halvorsen, Michael A. Riegler, Binh T. Nguyen,
- Abstract summary: We propose Direct Estimation for Randomly Missing Data with Categorical Features (DPERC)
DPERC is an efficient approach for direct parameter estimation tailored to mixed data that contains missing values within continuous features.
Our approach effectively harnesses the information embedded within mixed data structures.
- Score: 5.319419944378011
- License:
- Abstract: The covariance matrix is a foundation in numerous statistical and machine-learning applications such as Principle Component Analysis, Correlation Heatmap, etc. However, missing values within datasets present a formidable obstacle to accurately estimating this matrix. While imputation methods offer one avenue for addressing this challenge, they often entail a trade-off between computational efficiency and estimation accuracy. Consequently, attention has shifted towards direct parameter estimation, given its precision and reduced computational burden. In this paper, we propose Direct Parameter Estimation for Randomly Missing Data with Categorical Features (DPERC), an efficient approach for direct parameter estimation tailored to mixed data that contains missing values within continuous features. Our method is motivated by leveraging information from categorical features, which can significantly enhance covariance matrix estimation for continuous features. Our approach effectively harnesses the information embedded within mixed data structures. Through comprehensive evaluations of diverse datasets, we demonstrate the competitive performance of DPERC compared to various contemporary techniques. In addition, we also show by experiments that DPERC is a valuable tool for visualizing the correlation heatmap.
Related papers
- ARD-VAE: A Statistical Formulation to Find the Relevant Latent Dimensions of Variational Autoencoders [0.5759862457142761]
We propose a statistical formulation to discover the relevant latent factors required for modeling a dataset.
We call the proposed method the automatic relevancy detection in the variational autoencoder (ARD-VAE)
arXiv Detail & Related papers (2025-01-18T23:27:05Z) - Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training.
We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO.
As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z) - Instance-Specific Asymmetric Sensitivity in Differential Privacy [2.855485723554975]
We build upon previous work that gives a paradigm for selecting an output through the exponential mechanism.
Our framework will slightly modify the closeness metric and instead give a simple and efficient application of the sparse vector technique.
arXiv Detail & Related papers (2023-11-02T05:01:45Z) - Correlation visualization under missing values: a comparison between
imputation and direct parameter estimation methods [4.963490281438653]
We compare the effects of various missing data methods on the correlation plot, focusing on two common missing patterns: random and monotone.
We recommend using DPER, a direct parameter estimation approach, for plotting the correlation matrix based on its performance in the experiments.
arXiv Detail & Related papers (2023-05-10T10:52:30Z) - Exogenous Data in Forecasting: FARM -- A New Measure for Relevance
Evaluation [62.997667081978825]
We introduce a new approach named FARM - Forward Relevance Aligned Metric.
Our forward method relies on an angular measure that compares changes in subsequent data points to align time-warped series.
As a first validation step, we present the application of our FARM approach to synthetic but representative signals.
arXiv Detail & Related papers (2023-04-21T15:22:33Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic
Uncertainty [58.144520501201995]
Bi-Lipschitz regularization of neural network layers preserve relative distances between data instances in the feature spaces of each layer.
With the use of an attentive set encoder, we propose to meta learn either diagonal or diagonal plus low-rank factors to efficiently construct task specific covariance matrices.
We also propose an inference procedure which utilizes scaled energy to achieve a final predictive distribution.
arXiv Detail & Related papers (2021-10-12T22:04:19Z) - Doing Great at Estimating CATE? On the Neglected Assumptions in
Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading.
We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators.
We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z) - FCMI: Feature Correlation based Missing Data Imputation [0.0]
We propose an efficient technique to impute the missing value in the dataset based on correlation called FCMI.
Our proposed algorithm picks the highly correlated attributes of the dataset and uses these attributes to build a regression model.
Experiments conducted on both classification and regression datasets show that the proposed imputation technique outperforms existing imputation algorithms.
arXiv Detail & Related papers (2021-06-26T13:35:33Z) - Doubly Robust Semiparametric Difference-in-Differences Estimators with
High-Dimensional Data [15.27393561231633]
We propose a doubly robust two-stage semiparametric difference-in-difference estimator for estimating heterogeneous treatment effects.
The first stage allows a general set of machine learning methods to be used to estimate the propensity score.
In the second stage, we derive the rates of convergence for both the parametric parameter and the unknown function.
arXiv Detail & Related papers (2020-09-07T15:14:29Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.