Regularization for Shuffled Data Problems via Exponential Family Priors
on the Permutation Group
- URL: http://arxiv.org/abs/2111.01767v1
- Date: Tue, 2 Nov 2021 17:43:28 GMT
- Title: Regularization for Shuffled Data Problems via Exponential Family Priors
on the Permutation Group
- Authors: Zhenbang Wang, Emanuel Ben-David, Martin Slawski
- Abstract summary: "Shuffled data" is data in which the correct pairing of (X, Y)-pairs is represented via an unknown index permutation.
We propose a flexible exponential family prior on the permutation group for this purpose.
Inference is based on the EM algorithm in which the intractable E-step is approximated by the Fisher-Yates algorithm.
- Score: 8.40077201352607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the analysis of data sets consisting of (X, Y)-pairs, a tacit assumption
is that each pair corresponds to the same observation unit. If, however, such
pairs are obtained via record linkage of two files, this assumption can be
violated as a result of mismatch error rooting, for example, in the lack of
reliable identifiers in the two files. Recently, there has been a surge of
interest in this setting under the term "Shuffled data" in which the underlying
correct pairing of (X, Y)-pairs is represented via an unknown index
permutation. Explicit modeling of the permutation tends to be associated with
substantial overfitting, prompting the need for suitable methods of
regularization. In this paper, we propose a flexible exponential family prior
on the permutation group for this purpose that can be used to integrate various
structures such as sparse and locally constrained shuffling. This prior turns
out to be conjugate for canonical shuffled data problems in which the
likelihood conditional on a fixed permutation can be expressed as product over
the corresponding (X,Y)-pairs. Inference is based on the EM algorithm in which
the intractable E-step is approximated by the Fisher-Yates algorithm. The
M-step is shown to admit a significant reduction from $n^2$ to $n$ terms if the
likelihood of (X,Y)-pairs has exponential family form as in the case of
generalized linear models. Comparisons on synthetic and real data show that the
proposed approach compares favorably to competing methods.
Related papers
- BI-EqNO: Generalized Approximate Bayesian Inference with an Equivariant Neural Operator Framework [9.408644291433752]
We introduce BI-EqNO, an equivariant neural operator framework for generalized approximate Bayesian inference.
BI-EqNO transforms priors into posteriors on conditioned observation data through data-driven training.
We demonstrate BI-EqNO's utility through two examples: (1) as a generalized Gaussian process (gGP) for regression, and (2) as an ensemble neural filter (EnNF) for sequential data assimilation.
arXiv Detail & Related papers (2024-10-21T18:39:16Z) - Conformalization of Sparse Generalized Linear Models [2.1485350418225244]
Conformal prediction method estimates a confidence set for $y_n+1$ that is valid for any finite sample size.
Although attractive, computing such a set is computationally infeasible in most regression problems.
We show how our path-following algorithm accurately approximates conformal prediction sets.
arXiv Detail & Related papers (2023-07-11T08:36:12Z) - Doubly Stochastic Matrix Models for Estimation of Distribution
Algorithms [2.28438857884398]
We explore the use of Doubly Matrices (DSM) for matching and assignment nature permutation problems.
Specifically, we adopt the framework of estimation of distribution algorithms and compare DSMs to some existing proposals for permutation problems.
Preliminary experiments on instances of the quadratic assignment problem validate this line of research and show that DSMs may obtain very competitive results.
arXiv Detail & Related papers (2023-04-05T14:36:48Z) - Score-based Continuous-time Discrete Diffusion Models [102.65769839899315]
We extend diffusion models to discrete variables by introducing a Markov jump process where the reverse process denoises via a continuous-time Markov chain.
We show that an unbiased estimator can be obtained via simple matching the conditional marginal distributions.
We demonstrate the effectiveness of the proposed method on a set of synthetic and real-world music and image benchmarks.
arXiv Detail & Related papers (2022-11-30T05:33:29Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - A Robust and Flexible EM Algorithm for Mixtures of Elliptical
Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data.
A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data.
Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z) - Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms.
The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm.
As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z) - Sparse Quadratic Optimisation over the Stiefel Manifold with Application
to Permutation Synchronisation [71.27989298860481]
We address the non- optimisation problem of finding a matrix on the Stiefel manifold that maximises a quadratic objective function.
We propose a simple yet effective sparsity-promoting algorithm for finding the dominant eigenspace matrix.
arXiv Detail & Related papers (2021-09-30T19:17:35Z) - Learning Bayesian Networks through Birkhoff Polytope: A Relaxation
Method [0.0]
We establish a novel framework for learning a directed acyclic graph (DAG) when data are generated from a Gaussian, linear structural equation model.
For permutation matrix estimation, we propose a relaxation technique that avoids the NP-hard problem of order estimation.
Our framework recovers DAGs without the need for an expensive verification of the acyclicity constraint or enumeration of possible parent sets.
arXiv Detail & Related papers (2021-07-04T15:04:02Z) - Variance-Reduced Splitting Schemes for Monotone Stochastic Generalized
Equations [0.0]
We consider monotone inclusion problems where the operators may be expectation-valued.
A direct application of splitting schemes is complicated by the need to resolve problems with expectation-valued maps at each step.
We propose an avenue for addressing uncertainty in the mapping: Variance-reduced modified forward-backward splitting scheme.
arXiv Detail & Related papers (2020-08-26T02:33:27Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.