Variable Selection in Maximum Mean Discrepancy for Interpretable
Distribution Comparison
- URL: http://arxiv.org/abs/2311.01537v1
- Date: Thu, 2 Nov 2023 18:38:39 GMT
- Title: Variable Selection in Maximum Mean Discrepancy for Interpretable
Distribution Comparison
- Authors: Kensuke Mitsuzawa, Motonobu Kanagawa, Stefano Bortoli, Margherita
Grossi and Paolo Papotti
- Abstract summary: Two-sample testing decides whether two datasets are generated from the same distribution.
This paper studies variable selection for two-sample testing, the task being to identify the variables responsible for the discrepancies between the two distributions.
- Score: 9.12501922682336
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Two-sample testing decides whether two datasets are generated from the same
distribution. This paper studies variable selection for two-sample testing, the
task being to identify the variables (or dimensions) responsible for the
discrepancies between the two distributions. This task is relevant to many
problems of pattern analysis and machine learning, such as dataset shift
adaptation, causal inference and model validation. Our approach is based on a
two-sample test based on the Maximum Mean Discrepancy (MMD). We optimise the
Automatic Relevance Detection (ARD) weights defined for individual variables to
maximise the power of the MMD-based test. For this optimisation, we introduce
sparse regularisation and propose two methods for dealing with the issue of
selecting an appropriate regularisation parameter. One method determines the
regularisation parameter in a data-driven way, and the other aggregates the
results of different regularisation parameters. We confirm the validity of the
proposed methods by systematic comparisons with baseline methods, and
demonstrate their usefulness in exploratory analysis of high-dimensional
traffic simulation data. Preliminary theoretical analyses are also provided,
including a rigorous definition of variable selection for two-sample testing.
Related papers
- Unified Convergence Analysis for Score-Based Diffusion Models with Deterministic Samplers [49.1574468325115]
We introduce a unified convergence analysis framework for deterministic samplers.
Our framework achieves iteration complexity of $tilde O(d2/epsilon)$.
We also provide a detailed analysis of Denoising Implicit Diffusion Models (DDIM)-type samplers.
arXiv Detail & Related papers (2024-10-18T07:37:36Z) - Generative vs. Discriminative modeling under the lens of uncertainty quantification [0.929965561686354]
In this paper, we undertake a comparative analysis of generative and discriminative approaches.
We compare the ability of both approaches to leverage information from various sources in an uncertainty aware inference.
We propose a general sampling scheme enabling supervised learning for both approaches, as well as semi-supervised learning when compatible with the considered modeling approach.
arXiv Detail & Related papers (2024-06-13T14:32:43Z) - Winning Prize Comes from Losing Tickets: Improve Invariant Learning by
Exploring Variant Parameters for Out-of-Distribution Generalization [76.27711056914168]
Out-of-Distribution (OOD) Generalization aims to learn robust models that generalize well to various environments without fitting to distribution-specific features.
Recent studies based on Lottery Ticket Hypothesis (LTH) address this problem by minimizing the learning target to find some of the parameters that are critical to the task.
We propose Exploring Variant parameters for Invariant Learning (EVIL) which also leverages the distribution knowledge to find the parameters that are sensitive to distribution shift.
arXiv Detail & Related papers (2023-10-25T06:10:57Z) - Variable Selection for Kernel Two-Sample Tests [10.768155884359777]
We propose a framework based on the kernel maximum mean discrepancy (MMD)
We present mixed-integer programming formulations and develop exact and approximation algorithms with performance guarantees.
Experiment results on synthetic and real datasets demonstrate the superior performance of our approach.
arXiv Detail & Related papers (2023-02-15T00:39:56Z) - Spectral Regularized Kernel Two-Sample Tests [7.915420897195129]
We show the popular MMD (maximum mean discrepancy) two-sample test to be not optimal in terms of the separation boundary measured in Hellinger distance.
We propose a modification to the MMD test based on spectral regularization and prove the proposed test to be minimax optimal with a smaller separation boundary than that achieved by the MMD test.
Our results hold for the permutation variant of the test where the test threshold is chosen elegantly through the permutation of the samples.
arXiv Detail & Related papers (2022-12-19T00:42:21Z) - Two-Stage Robust and Sparse Distributed Statistical Inference for
Large-Scale Data [18.34490939288318]
We address the problem of conducting statistical inference in settings involving large-scale data that may be high-dimensional and contaminated by outliers.
We propose a two-stage distributed and robust statistical inference procedures coping with high-dimensional models by promoting sparsity.
arXiv Detail & Related papers (2022-08-17T11:17:47Z) - A Lagrangian Duality Approach to Active Learning [119.36233726867992]
We consider the batch active learning problem, where only a subset of the training data is labeled.
We formulate the learning problem using constrained optimization, where each constraint bounds the performance of the model on labeled samples.
We show, via numerical experiments, that our proposed approach performs similarly to or better than state-of-the-art active learning methods.
arXiv Detail & Related papers (2022-02-08T19:18:49Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Greedy Search Algorithms for Unsupervised Variable Selection: A
Comparative Study [3.4888132404740797]
This paper focuses on unsupervised variable selection based dimensionality reduction.
We present a critical evaluation of seven unsupervised greedy variable selection algorithms.
We introduce and evaluate for the first time, a lazy implementation of the variance explained based forward selection component analysis (FSCA) algorithm.
arXiv Detail & Related papers (2021-03-03T21:10:26Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.