Extended Missing Data Imputation via GANs for Ranking Applications
- URL: http://arxiv.org/abs/2011.02089v3
- Date: Wed, 10 Nov 2021 16:44:08 GMT
- Title: Extended Missing Data Imputation via GANs for Ranking Applications
- Authors: Grace Deng, Cuize Han, David S. Matteson
- Abstract summary: Conditional Imputation GAN is an extended missing data imputation method based on Generative Adversarial Networks (GANs)
We prove that the optimal GAN imputation is achieved for Extended Missing At Random (EMAR) and Extended Always Missing At Random (EAMAR) mechanisms, beyond the naive MCAR.
- Score: 5.2710726359379265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Conditional Imputation GAN, an extended missing data imputation
method based on Generative Adversarial Networks (GANs). The motivating use case
is learning-to-rank, the cornerstone of modern search, recommendation system,
and information retrieval applications. Empirical ranking datasets do not
always follow standard Gaussian distributions or Missing Completely At Random
(MCAR) mechanism, which are standard assumptions of classic missing data
imputation methods. Our methodology provides a simple solution that offers
compatible imputation guarantees while relaxing assumptions for missing
mechanisms and sidesteps approximating intractable distributions to improve
imputation quality. We prove that the optimal GAN imputation is achieved for
Extended Missing At Random (EMAR) and Extended Always Missing At Random (EAMAR)
mechanisms, beyond the naive MCAR. Our method demonstrates the highest
imputation quality on the open-source Microsoft Research Ranking (MSR) Dataset
and a synthetic ranking dataset compared to state-of-the-art benchmarks and
across various feature distributions. Using a proprietary Amazon Search ranking
dataset, we also demonstrate comparable ranking quality metrics for ranking
models trained on GAN-imputed data compared to ground-truth data.
Related papers
- Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - MissDAG: Causal Discovery in the Presence of Missing Data with
Continuous Additive Noise Models [78.72682320019737]
We develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations.
MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization framework.
We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
arXiv Detail & Related papers (2022-05-27T09:59:46Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Multiple Imputation via Generative Adversarial Network for
High-dimensional Blockwise Missing Value Problems [6.123324869194195]
We propose Multiple Imputation via Generative Adversarial Network (MI-GAN), a deep learning-based (in specific, a GAN-based) multiple imputation method.
MI-GAN shows strong performance matching existing state-of-the-art imputation methods on high-dimensional datasets.
In particular, MI-GAN significantly outperforms other imputation methods in the sense of statistical inference and computational speed.
arXiv Detail & Related papers (2021-12-21T20:19:37Z) - MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data.
MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism.
We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z) - Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms.
The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm.
As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Imputation of Missing Data with Class Imbalance using Conditional
Generative Adversarial Networks [24.075691766743702]
We propose a new method for imputing missing data based on its class-specific characteristics.
Our Conditional Generative Adversarial Imputation Network (CGAIN) imputes the missing data using class-specific distributions.
We tested our approach on benchmark datasets and achieved superior performance compared with the state-of-the-art and popular imputation approaches.
arXiv Detail & Related papers (2020-12-01T02:26:54Z) - PC-GAIN: Pseudo-label Conditional Generative Adversarial Imputation
Networks for Incomplete Data [19.952411963344556]
PC-GAIN is a novel unsupervised missing data imputation method named PC-GAIN.
We first propose a pre-training procedure to learn potential category information contained in a subset of low-missing-rate data.
Then an auxiliary classifier is determined using the synthetic pseudo-labels.
arXiv Detail & Related papers (2020-11-16T08:08:26Z) - Missing Data Imputation using Optimal Transport [43.14084843713895]
We leverage optimal transport distances to quantify a criterion and turn it into a loss function to impute missing data values.
We propose practical methods to minimize these losses using end-to-end learning.
These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.
arXiv Detail & Related papers (2020-02-10T15:23:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.