Testing High-dimensional Multinomials with Applications to Text Analysis
- URL: http://arxiv.org/abs/2301.01381v2
- Date: Fri, 24 Nov 2023 22:29:18 GMT
- Title: Testing High-dimensional Multinomials with Applications to Text Analysis
- Authors: T. Tony Cai, Zheng Tracy Ke, Paxton Turner
- Abstract summary: A test statistic is shown to have a standard normal distribution under the null.
The proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest.
- Score: 9.952321247299336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Motivated by applications in text mining and discrete distribution inference,
we investigate the testing for equality of probability mass functions of $K$
groups of high-dimensional multinomial distributions. A test statistic, which
is shown to have an asymptotic standard normal distribution under the null, is
proposed. The optimal detection boundary is established, and the proposed test
is shown to achieve this optimal detection boundary across the entire parameter
space of interest. The proposed method is demonstrated in simulation studies
and applied to analyze two real-world datasets to examine variation among
consumer reviews of Amazon movies and diversity of statistical paper abstracts.
Related papers
- Permutation-Free High-Order Interaction Tests [0.7373617024876725]
We introduce a family of permutation-free high-order tests for joint independence and partial factorisations of $d$ variables.<n>Our tests eliminate the need for permutation-based approximations by leveraging V-statistics and a novel cross-centring technique.
arXiv Detail & Related papers (2025-06-06T10:42:10Z) - Pre-validation Revisited [79.92204034170092]
We show properties and benefits of pre-validation in prediction, inference and error estimation by simulations and applications.<n>We propose not only an analytical distribution of the test statistic for the pre-validated predictor under certain models, but also a generic bootstrap procedure to conduct inference.
arXiv Detail & Related papers (2025-05-21T00:20:14Z) - Minimax Optimal Kernel Two-Sample Tests with Random Features [8.030917052755195]
We propose a spectral regularized two-sample test based on random Fourier feature (RFF) approximation.
We show the proposed test to be minimax optimal if the approximation order of RFF is sufficiently large.
We develop a practically implementable permutation-based version of the proposed test with a data-adaptive strategy for selecting the regularization parameter and the kernel.
arXiv Detail & Related papers (2025-02-28T06:12:00Z) - An Efficient Permutation-Based Kernel Two-Sample Test [13.229867216847534]
Two-sample hypothesis testing is a fundamental problem in statistics and machine learning.
In this work, we use a Nystr"om approximation of the maximum mean discrepancy (MMD) to design a computationally efficient and practical testing algorithm.
arXiv Detail & Related papers (2025-02-19T09:22:48Z) - Winners with Confidence: Discrete Argmin Inference with an Application to Model Selection [11.62889979871371]
We study the problem of finding the index of the minimum value of a vector noisy observations.
This problem is relevant in population/policy comparison, discrete maximum likelihood, and model selection.
We develop anally normal test statistic, even in high-dimensional settings.
arXiv Detail & Related papers (2024-08-04T15:20:23Z) - Combine and Conquer: A Meta-Analysis on Data Shift and Out-of-Distribution Detection [30.377446496559635]
This paper introduces a universal approach to seamlessly combine out-of-distribution (OOD) detection scores.
Our framework is easily for future developments in detection scores and stands as the first to combine decision boundaries in this context.
arXiv Detail & Related papers (2024-06-23T08:16:44Z) - Embedding Trajectory for Out-of-Distribution Detection in Mathematical Reasoning [50.84938730450622]
We propose a trajectory-based method TV score, which uses trajectory volatility for OOD detection in mathematical reasoning.
Our method outperforms all traditional algorithms on GLMs under mathematical reasoning scenarios.
Our method can be extended to more applications with high-density features in output spaces, such as multiple-choice questions.
arXiv Detail & Related papers (2024-05-22T22:22:25Z) - Collaborative non-parametric two-sample testing [55.98760097296213]
The goal is to identify nodes where the null hypothesis $p_v = q_v$ should be rejected.
We propose the non-parametric collaborative two-sample testing (CTST) framework that efficiently leverages the graph structure.
Our methodology integrates elements from f-divergence estimation, Kernel Methods, and Multitask Learning.
arXiv Detail & Related papers (2024-02-08T14:43:56Z) - Distributed Markov Chain Monte Carlo Sampling based on the Alternating
Direction Method of Multipliers [143.6249073384419]
In this paper, we propose a distributed sampling scheme based on the alternating direction method of multipliers.
We provide both theoretical guarantees of our algorithm's convergence and experimental evidence of its superiority to the state-of-the-art.
In simulation, we deploy our algorithm on linear and logistic regression tasks and illustrate its fast convergence compared to existing gradient-based methods.
arXiv Detail & Related papers (2024-01-29T02:08:40Z) - Boosting the Power of Kernel Two-Sample Tests [4.07125466598411]
A kernel two-sample test based on the maximum mean discrepancy (MMD) is one of the most popular methods for detecting differences between two distributions over general metric spaces.
We propose a method to boost the power of the kernel test by combining MMD estimates over multiple kernels using their Mahalanobis distance.
arXiv Detail & Related papers (2023-02-21T14:14:30Z) - Spectral Regularized Kernel Two-Sample Tests [7.915420897195129]
We show the popular MMD (maximum mean discrepancy) two-sample test to be not optimal in terms of the separation boundary measured in Hellinger distance.
We propose a modification to the MMD test based on spectral regularization and prove the proposed test to be minimax optimal with a smaller separation boundary than that achieved by the MMD test.
Our results hold for the permutation variant of the test where the test threshold is chosen elegantly through the permutation of the samples.
arXiv Detail & Related papers (2022-12-19T00:42:21Z) - Differential privacy and robust statistics in high dimensions [49.50869296871643]
High-dimensional Propose-Test-Release (HPTR) builds upon three crucial components: the exponential mechanism, robust statistics, and the Propose-Test-Release mechanism.
We show that HPTR nearly achieves the optimal sample complexity under several scenarios studied in the literature.
arXiv Detail & Related papers (2021-11-12T06:36:40Z) - Hypothesis Testing for Equality of Latent Positions in Random Graphs [0.2741266294612775]
We consider the hypothesis testing problem that two vertices $i$ and $j$th have the same latent positions, possibly up to scaling.
We propose several test statistics based on the empirical Mahalanobis distances between the $i$th and $j$th rows of either the adjacency or the normalized Laplacian spectral embedding of the graph.
Using these test statistics, we address the model selection problem of choosing between the standard block model and its degree-corrected variant.
arXiv Detail & Related papers (2021-05-23T01:27:23Z) - The UU-test for Statistical Modeling of Unimodal Data [0.20305676256390928]
We propose a technique called UU-test (Unimodal Uniform test) to decide on the unimodality of a one-dimensional dataset.
A unique feature of this approach is that in the case of unimodality, it also provides a statistical model of the data in the form of a Uniform Mixture Model.
arXiv Detail & Related papers (2020-08-28T08:34:28Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.