Related papers: Binary Hypothesis Testing for Softmax Models and Leverage Score Models

Binary Hypothesis Testing for Softmax Models and Leverage Score Models

URL: http://arxiv.org/abs/2405.06003v1
Date: Thu, 9 May 2024 15:56:29 GMT
Title: Binary Hypothesis Testing for Softmax Models and Leverage Score Models
Authors: Yeqi Gao, Yuzhou Gu, Zhao Song,
Abstract summary: We consider the problem of binary hypothesis testing in the setting of softmax models. We draw analogy between the softmax model and the leverage score model.
Score: 8.06972158448711
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Softmax distributions are widely used in machine learning, including Large Language Models (LLMs) where the attention unit uses softmax distributions. We abstract the attention unit as the softmax model, where given a vector input, the model produces an output drawn from the softmax distribution (which depends on the vector input). We consider the fundamental problem of binary hypothesis testing in the setting of softmax models. That is, given an unknown softmax model, which is known to be one of the two given softmax models, how many queries are needed to determine which one is the truth? We show that the sample complexity is asymptotically $O(\epsilon^{-2})$ where $\epsilon$ is a certain distance between the parameters of the models. Furthermore, we draw analogy between the softmax model and the leverage score model, an important tool for algorithm design in linear algebra and graph theory. The leverage score model, on a high level, is a model which, given vector input, produces an output drawn from a distribution dependent on the input. We obtain similar results for the binary hypothesis testing problem for leverage score models.

Related papers

Learning-Order Autoregressive Models with Application to Molecular Graph Generation [52.44913282062524]
We introduce a variant of ARM that generates high-dimensional data using a probabilistic ordering that is sequentially inferred from data. We demonstrate experimentally that our method can learn meaningful autoregressive orderings in image and graph generation.
arXiv Detail & Related papers (2025-03-07T23:24:24Z)
Adaptive Sampled Softmax with Inverted Multi-Index: Methods, Theory and Applications [79.53938312089308]
The MIDX-Sampler is a novel adaptive sampling strategy based on an inverted multi-index approach. Our method is backed by rigorous theoretical analysis, addressing key concerns such as sampling bias, gradient bias, convergence rates, and generalization error bounds.
arXiv Detail & Related papers (2025-01-15T04:09:21Z)
Model Stealing for Any Low-Rank Language Model [25.16701867917684]
We build a theoretical understanding of stealing language models by studying a simple and mathematically tractable setting. Our main result is an efficient algorithm in the conditional query model, for learning any low-rank distribution. This is an interesting example where, at least theoretically, allowing a machine learning model to solve more complex problems at inference time can lead to drastic improvements in its performance.
arXiv Detail & Related papers (2024-11-12T04:25:31Z)
Beyond Closure Models: Learning Chaotic-Systems via Physics-Informed Neural Operators [78.64101336150419]
Predicting the long-term behavior of chaotic systems is crucial for various applications such as climate modeling. An alternative approach to such a full-resolved simulation is using a coarse grid and then correcting its errors through a temporalittext model. We propose an alternative end-to-end learning approach using a physics-informed neural operator (PINO) that overcomes this limitation.
arXiv Detail & Related papers (2024-08-09T17:05:45Z)
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs) We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model. We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z)
BayesBlend: Easy Model Blending using Pseudo-Bayesian Model Averaging, Stacking and Hierarchical Stacking in Python [0.0]
We introduce the BayesBlend Python package to estimate weights and blend multiple (Bayesian) models' predictive distributions. BayesBlend implements pseudo-Bayesian model averaging, stacking and, uniquely, hierarchical Bayesian stacking to estimate model weights. We demonstrate the usage of BayesBlend with examples of insurance loss modeling.
arXiv Detail & Related papers (2024-04-30T19:15:33Z)
A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints [87.08677547257733]
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning. We show how to maximize the likelihood of a symbolic constraint w.r.t the neural network's output distribution. We also evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation.
arXiv Detail & Related papers (2023-12-06T20:58:07Z)
Attention Scheme Inspired Softmax Regression [20.825033982038455]
Large language models (LLMs) have made transformed changes for human society. One of the key computation in LLMs is the softmax unit. In this work, inspired the softmax unit, we define a softmax regression problem.
arXiv Detail & Related papers (2023-04-20T15:50:35Z)
r-softmax: Generalized Softmax with Controllable Sparsity Rate [11.39524236962986]
We propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate. We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax.
arXiv Detail & Related papers (2023-04-11T14:28:29Z)
Minimax Optimal Quantization of Linear Models: Information-Theoretic Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements. We derive an information-theoretic lower bound for the minimax risk under this setting. We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z)
Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it. Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z)
Argmax Flows and Multinomial Diffusion: Towards Non-Autoregressive Language Models [76.22217735434661]
This paper introduces two new classes of generative models for categorical data: Argmax Flows and Multinomial Diffusion. We demonstrate that our models perform competitively on language modelling and modelling of image segmentation maps.
arXiv Detail & Related papers (2021-02-10T11:04:17Z)
Estimating Stochastic Linear Combination of Non-linear Regressions Efficiently and Scalably [23.372021234032363]
We show that when the sub-sample sizes are large then the estimation errors will be sacrificed by too much. To the best of our knowledge, this is the first work that and guarantees for the lineartext+Stochasticity model.
arXiv Detail & Related papers (2020-10-19T07:15:38Z)
On the Discrepancy between Density Estimation and Sequence Generation [92.70116082182076]
log-likelihood is highly correlated with BLEU when we consider models within the same family. We observe no correlation between rankings of models across different families.
arXiv Detail & Related papers (2020-02-17T20:13:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.