Dealing with missing data using attention and latent space
regularization
- URL: http://arxiv.org/abs/2211.07059v1
- Date: Mon, 14 Nov 2022 01:05:28 GMT
- Title: Dealing with missing data using attention and latent space
regularization
- Authors: Jahan C. Penny-Dimri, Christoph Bergmeir, Julian Smith
- Abstract summary: We develop a theoretical framework for training and inference using only observed variables.
We construct models with latent space representations that regularize against the potential bias introduced by missing data.
We show that our proposed method overcomes the weaknesses of imputation methods and outperforms the current state-of-the-art.
- Score: 2.610470075814367
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most practical data science problems encounter missing data. A wide variety
of solutions exist, each with strengths and weaknesses that depend upon the
missingness-generating process. Here we develop a theoretical framework for
training and inference using only observed variables enabling modeling of
incomplete datasets without imputation. Using an information and
measure-theoretic argument we construct models with latent space
representations that regularize against the potential bias introduced by
missing data. The theoretical properties of this approach are demonstrated
empirically using a synthetic dataset. The performance of this approach is
tested on 11 benchmarking datasets with missingness and 18 datasets corrupted
across three missingness patterns with comparison against a state-of-the-art
model and industry-standard imputation. We show that our proposed method
overcomes the weaknesses of imputation methods and outperforms the current
state-of-the-art.
Related papers
- Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Optimal Transport for Structure Learning Under Missing Data [31.240965564055138]
We propose a score-based algorithm for learning causal structures from missing data based on optimal transport.
Our framework is shown to recover the true causal structure more effectively than competing methods in most simulations and real-data settings.
arXiv Detail & Related papers (2024-02-23T10:49:04Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - MissDiff: Training Diffusion Models on Tabular Data with Missing Values [29.894691645801597]
This work presents a unified and principled diffusion-based framework for learning from data with missing values.
We first observe that the widely adopted "impute-then-generate" pipeline may lead to a biased learning objective.
We prove the proposed method is consistent in learning the score of data distributions, and the proposed training objective serves as an upper bound for the negative likelihood in certain cases.
arXiv Detail & Related papers (2023-07-02T03:49:47Z) - MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data.
MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism.
We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z) - Efficient Multidimensional Functional Data Analysis Using Marginal
Product Basis Systems [2.4554686192257424]
We propose a framework for learning continuous representations from a sample of multidimensional functional data.
We show that the resulting estimation problem can be solved efficiently by the tensor decomposition.
We conclude with a real data application in neuroimaging.
arXiv Detail & Related papers (2021-07-30T16:02:15Z) - OR-Net: Pointwise Relational Inference for Data Completion under Partial
Observation [51.083573770706636]
This work uses relational inference to fill in the incomplete data.
We propose Omni-Relational Network (OR-Net) to model the pointwise relativity in two aspects.
arXiv Detail & Related papers (2021-05-02T06:05:54Z) - MAIN: Multihead-Attention Imputation Networks [4.427447378048202]
We propose a novel mechanism based on multi-head attention which can be applied effortlessly in any model.
Our method inductively models patterns of missingness in the input data in order to increase the performance of the downstream task.
arXiv Detail & Related papers (2021-02-10T13:50:02Z) - Accounting for Unobserved Confounding in Domain Generalization [107.0464488046289]
This paper investigates the problem of learning robust, generalizable prediction models from a combination of datasets.
Part of the challenge of learning robust models lies in the influence of unobserved confounders.
We demonstrate the empirical performance of our approach on healthcare data from different modalities.
arXiv Detail & Related papers (2020-07-21T08:18:06Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Multiple Imputation with Denoising Autoencoder using Metamorphic Truth
and Imputation Feedback [0.0]
We propose a Multiple Imputation model using Denoising Autoencoders to learn the internal representation of data.
We use the novel mechanisms of Metamorphic Truth and Imputation Feedback to maintain statistical integrity of attributes.
Our approach explores the effects of imputation on various missingness mechanisms and patterns of missing data, outperforming other methods in many standard test cases.
arXiv Detail & Related papers (2020-02-19T18:26:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.