Adjusted chi-square test for degree-corrected block models
- URL: http://arxiv.org/abs/2012.15047v1
- Date: Wed, 30 Dec 2020 05:20:59 GMT
- Title: Adjusted chi-square test for degree-corrected block models
- Authors: Linfan Zhang and Arash A. Amini
- Abstract summary: We propose a goodness-of-fit test for degree-corrected block models (DCSBM)
We show that a simple adjustment allows the statistic to converge in distribution, under null, as long as the harmonic mean of $d_i$ grows to infinity.
Our distributional results are nonasymptotic, with explicit constants, providing finite-sample bounds on the Kolmogorov-Smirnov distance to the target distribution.
- Score: 13.122543280692641
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a goodness-of-fit test for degree-corrected stochastic block
models (DCSBM). The test is based on an adjusted chi-square statistic for
measuring equality of means among groups of $n$ multinomial distributions with
$d_1,\dots,d_n$ observations. In the context of network models, the number of
multinomials, $n$, grows much faster than the number of observations, $d_i$,
hence the setting deviates from classical asymptotics. We show that a simple
adjustment allows the statistic to converge in distribution, under null, as
long as the harmonic mean of $\{d_i\}$ grows to infinity. This result applies
to large sparse networks where the role of $d_i$ is played by the degree of
node $i$. Our distributional results are nonasymptotic, with explicit
constants, providing finite-sample bounds on the Kolmogorov-Smirnov distance to
the target distribution. When applied sequentially, the test can also be used
to determine the number of communities. The test operates on a (row) compressed
version of the adjacency matrix, conditional on the degrees, and as a result is
highly scalable to large sparse networks. We incorporate a novel idea of
compressing the columns based on a $(K+1)$-community assignment when testing
for $K$ communities. This approach increases the power in sequential
applications without sacrificing computational efficiency, and we prove its
consistency in recovering the number of communities. Since the test statistic
does not rely on a specific alternative, its utility goes beyond sequential
testing and can be used to simultaneously test against a wide range of
alternatives outside the DCSBM family. We show the effectiveness of the
approach by extensive numerical experiments with simulated and real data. In
particular, applying the test to the Facebook-100 dataset, we find that a DCSBM
with a small number of communities is far from a good fit in almost all cases.
Related papers
- Doubly Robust Conditional Independence Testing with Generative Neural Networks [8.323172773256449]
This article addresses the problem of testing the conditional independence of two generic random vectors $X$ and $Y$ given a third random vector $Z$.
We propose a new non-parametric testing procedure that avoids explicitly estimating any conditional distributions.
arXiv Detail & Related papers (2024-07-25T01:28:59Z) - Collaborative non-parametric two-sample testing [55.98760097296213]
The goal is to identify nodes where the null hypothesis $p_v = q_v$ should be rejected.
We propose the non-parametric collaborative two-sample testing (CTST) framework that efficiently leverages the graph structure.
Our methodology integrates elements from f-divergence estimation, Kernel Methods, and Multitask Learning.
arXiv Detail & Related papers (2024-02-08T14:43:56Z) - A Permutation-free Kernel Two-Sample Test [36.50719125230106]
We propose a new quadratic-time MMD test statistic based on sample-splitting and studentization.
For large sample sizes, our new cross-MMD provides a significant speedup over the MMD, for only a slight loss in power.
arXiv Detail & Related papers (2022-11-27T18:15:52Z) - The Projected Covariance Measure for assumption-lean variable significance testing [3.8936058127056357]
A simple but common approach is to specify a linear model, and then test whether the regression coefficient for $X$ is non-zero.
We study the problem of testing the model-free null of conditional mean independence, i.e. that the conditional mean of $Y$ given $X$ and $Z$ does not depend on $X$.
We propose a simple and general framework that can leverage flexible nonparametric or machine learning methods, such as additive models or random forests.
arXiv Detail & Related papers (2022-11-03T17:55:50Z) - Efficient Aggregated Kernel Tests using Incomplete $U$-statistics [22.251118308736327]
Three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales.
We show that our proposed linear-time aggregated tests obtain higher power than current state-of-the-art linear-time kernel tests.
arXiv Detail & Related papers (2022-06-18T12:30:06Z) - Exact Paired-Permutation Testing for Structured Test Statistics [67.71280539312536]
We provide an efficient exact algorithm for the paired-permutation test for a family of structured test statistics.
Our exact algorithm was $10$x faster than the Monte Carlo approximation with $20000$ samples on a common dataset.
arXiv Detail & Related papers (2022-05-03T11:00:59Z) - The Sample Complexity of Robust Covariance Testing [56.98280399449707]
We are given i.i.d. samples from a distribution of the form $Z = (1-epsilon) X + epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $mathcalN(0, Sigma)$.
In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples.
We prove a sample complexity lower bound of $Omega(d2)$ for $epsilon$ an arbitrarily small constant and $gamma
arXiv Detail & Related papers (2020-12-31T18:24:41Z) - Dimension-agnostic inference using cross U-statistics [33.17951971728784]
We introduce an approach that uses variational representations of existing test statistics along with sample splitting and self-normalization.
The resulting statistic can be viewed as a careful modification of degenerate U-statistics, dropping diagonal blocks and retaining off-diagonal blocks.
arXiv Detail & Related papers (2020-11-10T12:21:34Z) - Optimal Testing of Discrete Distributions with High Probability [49.19942805582874]
We study the problem of testing discrete distributions with a focus on the high probability regime.
We provide the first algorithms for closeness and independence testing that are sample-optimal, within constant factors.
arXiv Detail & Related papers (2020-09-14T16:09:17Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.