Related papers: Extended Missing Data Imputation via GANs for Ranking Applications

Extended Missing Data Imputation via GANs for Ranking Applications

URL: http://arxiv.org/abs/2011.02089v3
Date: Wed, 10 Nov 2021 16:44:08 GMT
Title: Extended Missing Data Imputation via GANs for Ranking Applications
Authors: Grace Deng, Cuize Han, David S. Matteson
Abstract summary: Conditional Imputation GAN is an extended missing data imputation method based on Generative Adversarial Networks (GANs) We prove that the optimal GAN imputation is achieved for Extended Missing At Random (EMAR) and Extended Always Missing At Random (EAMAR) mechanisms, beyond the naive MCAR.
Score: 5.2710726359379265
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose Conditional Imputation GAN, an extended missing data imputation method based on Generative Adversarial Networks (GANs). The motivating use case is learning-to-rank, the cornerstone of modern search, recommendation system, and information retrieval applications. Empirical ranking datasets do not always follow standard Gaussian distributions or Missing Completely At Random (MCAR) mechanism, which are standard assumptions of classic missing data imputation methods. Our methodology provides a simple solution that offers compatible imputation guarantees while relaxing assumptions for missing mechanisms and sidesteps approximating intractable distributions to improve imputation quality. We prove that the optimal GAN imputation is achieved for Extended Missing At Random (EMAR) and Extended Always Missing At Random (EAMAR) mechanisms, beyond the naive MCAR. Our method demonstrates the highest imputation quality on the open-source Microsoft Research Ranking (MSR) Dataset and a synthetic ranking dataset compared to state-of-the-art benchmarks and across various feature distributions. Using a proprietary Amazon Search ranking dataset, we also demonstrate comparable ranking quality metrics for ranking models trained on GAN-imputed data compared to ground-truth data.

Related papers

Recursive Equations For Imputation Of Missing Not At Random Data With Sparse Pattern Support [8.863778901027061]
A common approach for handling missing values in data analysis pipelines is multiple imputation via software packages.<n>We develop a new characterization for the full data law in graphical models of missing data.<n>We show MISPR obtains comparable results to MICE when data are MAR, and superior, less biased results when data are MNAR.
arXiv Detail & Related papers (2025-07-21T23:18:36Z)
Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels [64.94853276821992]
Large multimodal models (LMMs) are increasingly deployed across diverse applications. Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics. We explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities.
arXiv Detail & Related papers (2024-12-09T13:05:43Z)
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z)
MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models [78.72682320019737]
We develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization framework. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
arXiv Detail & Related papers (2022-05-27T09:59:46Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Multiple Imputation via Generative Adversarial Network for High-dimensional Blockwise Missing Value Problems [6.123324869194195]
We propose Multiple Imputation via Generative Adversarial Network (MI-GAN), a deep learning-based (in specific, a GAN-based) multiple imputation method. MI-GAN shows strong performance matching existing state-of-the-art imputation methods on high-dimensional datasets. In particular, MI-GAN significantly outperforms other imputation methods in the sense of statistical inference and computational speed.
arXiv Detail & Related papers (2021-12-21T20:19:37Z)
MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data. MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism. We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z)
Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms. The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm. As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z)
Risk Minimization from Adaptively Collected Data: Guarantees for Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data. We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class. For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z)
Imputation of Missing Data with Class Imbalance using Conditional Generative Adversarial Networks [24.075691766743702]
We propose a new method for imputing missing data based on its class-specific characteristics. Our Conditional Generative Adversarial Imputation Network (CGAIN) imputes the missing data using class-specific distributions. We tested our approach on benchmark datasets and achieved superior performance compared with the state-of-the-art and popular imputation approaches.
arXiv Detail & Related papers (2020-12-01T02:26:54Z)
PC-GAIN: Pseudo-label Conditional Generative Adversarial Imputation Networks for Incomplete Data [19.952411963344556]
PC-GAIN is a novel unsupervised missing data imputation method named PC-GAIN. We first propose a pre-training procedure to learn potential category information contained in a subset of low-missing-rate data. Then an auxiliary classifier is determined using the synthetic pseudo-labels.
arXiv Detail & Related papers (2020-11-16T08:08:26Z)
Missing Data Imputation using Optimal Transport [43.14084843713895]
We leverage optimal transport distances to quantify a criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.
arXiv Detail & Related papers (2020-02-10T15:23:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.