Related papers: A Method of Moments Embedding Constraint and its Application to Semi-Supervised Learning

A Method of Moments Embedding Constraint and its Application to Semi-Supervised Learning

URL: http://arxiv.org/abs/2404.17978v1
Date: Sat, 27 Apr 2024 18:41:32 GMT
Title: A Method of Moments Embedding Constraint and its Application to Semi-Supervised Learning
Authors: Michael Majurski, Sumeet Menon, Parniyan Farvardin, David Chapman,
Abstract summary: Discnative deep learning models with a linear+softmax final layer have a problem. Latent space only predicts the conditional probabilities $p(Y|X)$ but not the full joint distribution $p(Y,X)$. This exacerbates model over-confidence impacting many problems, such as hallucinations, confounding biases, and dependence on large datasets.
Score: 2.8266810371534152
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Discriminative deep learning models with a linear+softmax final layer have a problem: the latent space only predicts the conditional probabilities $p(Y|X)$ but not the full joint distribution $p(Y,X)$, which necessitates a generative approach. The conditional probability cannot detect outliers, causing outlier sensitivity in softmax networks. This exacerbates model over-confidence impacting many problems, such as hallucinations, confounding biases, and dependence on large datasets. To address this we introduce a novel embedding constraint based on the Method of Moments (MoM). We investigate the use of polynomial moments ranging from 1st through 4th order hyper-covariance matrices. Furthermore, we use this embedding constraint to train an Axis-Aligned Gaussian Mixture Model (AAGMM) final layer, which learns not only the conditional, but also the joint distribution of the latent space. We apply this method to the domain of semi-supervised image classification by extending FlexMatch with our technique. We find our MoM constraint with the AAGMM layer is able to match the reported FlexMatch accuracy, while also modeling the joint distribution, thereby reducing outlier sensitivity. We also present a preliminary outlier detection strategy based on Mahalanobis distance and discuss future improvements to this strategy. Code is available at: \url{https://github.com/mmajurski/ssl-gmm}

Related papers

SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference [29.49615352723995]
Mixture-of-Experts (MoE) models activate only a small subset of relevant experts per input.<n>The sheer number of expert networks in an MoE model introduces a significant storage burden for an edge device.<n>We propose a greedy decomposition method to decompose the original problem into a series of subproblems.
arXiv Detail & Related papers (2025-07-09T05:43:43Z)
MaskPro: Linear-Space Probabilistic Learning for Strict (N:M)-Sparsity on Large Language Models [53.36415620647177]
Semi-structured sparsity offers a promising solution by strategically retaining $N$ elements out of every $M$ weights.<n>Existing (N:M)-compatible approaches typically fall into two categories: rule-based layerwise greedy search, which suffers from considerable errors, and gradient-driven learning, which incurs prohibitive training costs.<n>We propose a novel linear-space probabilistic framework named MaskPro, which aims to learn a prior categorical distribution for every $M$ consecutive weights and subsequently leverages this distribution to generate the (N:M)-sparsity throughout an $N$-way sampling
arXiv Detail & Related papers (2025-06-15T15:02:59Z)
Sparse Linear Bandits with Blocking Constraints [22.01704171400845]
We investigate the high-dimensional sparse linear bandits problem in a data-poor regime.<n>We show novel offline statistical guarantees of the lasso estimator for the linear model.<n>We propose a meta-algorithm based on corralling that does not need knowledge of optimal sparsity parameter $k$ at minimal cost to regret.
arXiv Detail & Related papers (2024-10-26T01:42:03Z)
Agnostic Smoothed Online Learning [5.167069404528051]
We propose an algorithm to guarantee sublinear regret for smoothed online learning without prior knowledge of $mu$. R-Cover has adaptive regret $tilde O(sqrtdT/sigma)$ for function classes with dimension $d$, which is optimal up to logarithmic factors.
arXiv Detail & Related papers (2024-10-07T15:25:21Z)
Beyond Closure Models: Learning Chaotic-Systems via Physics-Informed Neural Operators [78.64101336150419]
Predicting the long-term behavior of chaotic systems is crucial for various applications such as climate modeling. An alternative approach to such a full-resolved simulation is using a coarse grid and then correcting its errors through a temporalittext model. We propose an alternative end-to-end learning approach using a physics-informed neural operator (PINO) that overcomes this limitation.
arXiv Detail & Related papers (2024-08-09T17:05:45Z)
Fast Semisupervised Unmixing Using Nonconvex Optimization [80.11512905623417]
We introduce a novel convex convex model for semi/library-based unmixing. We demonstrate the efficacy of Alternating Methods of sparse unsupervised unmixing.
arXiv Detail & Related papers (2024-01-23T10:07:41Z)
Learning with Stochastic Orders [25.795107089736295]
Learning high-dimensional distributions is often done with explicit likelihood modeling or implicit modeling via integral probability metrics (IPMs) We introduce the Choquet-Toland distance between probability measures, that can be used as a drop-in replacement for IPMsational. We also introduce the Variational Dominance Criterion (VDC) to learn probability measures with dominance constraints.
arXiv Detail & Related papers (2022-05-27T00:08:03Z)
Breaking the Softmax Bottleneck for Sequential Recommender Systems with Dropout and Decoupling [0.0]
We show that there are more aspects to the Softmax bottleneck in SBRSs. We propose a simple yet effective method, Dropout and Decoupling (D&D), to alleviate these problems. Our method significantly improves the accuracy of a variety of Softmax-based SBRS algorithms.
arXiv Detail & Related papers (2021-10-11T16:52:23Z)
A Unified Joint Maximum Mean Discrepancy for Domain Adaptation [73.44809425486767]
This paper theoretically derives a unified form of JMMD that is easy to optimize. From the revealed unified JMMD, we illustrate that JMMD degrades the feature-label dependence that benefits to classification. We propose a novel MMD matrix to promote the dependence, and devise a novel label kernel that is robust to label distribution shift.
arXiv Detail & Related papers (2021-01-25T09:46:14Z)
Differentially Private (Gradient) Expectation Maximization Algorithm with Statistical Guarantees [25.996994681543903]
(Gradient) Expectation Maximization (EM) is a widely used algorithm for estimating the maximum likelihood of mixture models or incomplete data problems. Previous research on this problem has already lead to the discovery of some Differentially Private (DP) algorithms for (Gradient) EM. We propose in this paper the first DP version of (Gradient) EM algorithm with statistical guarantees.
arXiv Detail & Related papers (2020-10-22T03:41:19Z)
Online and Distribution-Free Robustness: Regression and Contextual Bandits with Huber Contamination [29.85468294601847]
We revisit two classic high-dimensional online learning problems, namely linear regression and contextual bandits. We show that our algorithms succeed where conventional methods fail.
arXiv Detail & Related papers (2020-10-08T17:59:05Z)
Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection [71.9765117768556]
We consider the contextual bandit problem under the high dimensional linear model. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. We propose doubly growing epochs and estimating the parameter using the best subset selection method.
arXiv Detail & Related papers (2020-09-04T04:10:39Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error. We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator) We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$. We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.