Related papers: Sampling-Based Estimation of Jaccard Containment and Similarity

Sampling-Based Estimation of Jaccard Containment and Similarity

URL: http://arxiv.org/abs/2507.10019v3
Date: Sun, 20 Jul 2025 11:14:22 GMT
Title: Sampling-Based Estimation of Jaccard Containment and Similarity
Authors: Pranav Joshi,
Abstract summary: The study introduces a binomial model for predicting the overlap between samples, demonstrating that it is both accurate and practical when sample sizes are small compared to the original sets.<n>The framework is extended to estimate set similarity, and the paper provides guidance for applying these methods in large scale data systems where only partial or sampled data is available.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper addresses the problem of estimating the containment and similarity between two sets using only random samples from each set, without relying on sketches of full sets. The study introduces a binomial model for predicting the overlap between samples, demonstrating that it is both accurate and practical when sample sizes are small compared to the original sets. The paper compares this model to previous approaches and shows that it provides better estimates under the considered conditions. It also analyzes the statistical properties of the estimator, including error bounds and sample size requirements needed to achieve a desired level of accuracy and confidence. The framework is extended to estimate set similarity, and the paper provides guidance for applying these methods in large scale data systems where only partial or sampled data is available.

Related papers

Size-adaptive Hypothesis Testing for Fairness [8.315080617799445]
We introduce a unified, size-adaptive, hypothesis-testing framework that turns fairness assessment into an evidence-based statistical decision.<n>We prove a Central-Limit result for the statistical parity difference, leading to analytic confidence intervals and a Wald test whose type-I (false positive) error is guaranteed at level $alpha$.<n>For the long tail of small intersectional groups, we derive a fully Bayesian Dirichlet-multinomial estimator.
arXiv Detail & Related papers (2025-06-12T11:22:09Z)
Rethinking the generalization of drug target affinity prediction algorithms via similarity aware evaluation [19.145735532822012]
We show that the canonical randomized split of a test set in conventional evaluation leaves the test set dominated by samples with high similarity to the training set.<n>We propose a framework of similarity aware evaluation in which a novel split methodology is proposed to adapt to any desired distribution.<n>Results demonstrate that the proposed split methodology can significantly better fit desired distributions and guide the development of models.
arXiv Detail & Related papers (2025-04-13T08:30:57Z)
Assessing Model Generalization in Vicinity [34.86022681163714]
This paper evaluates the generalization ability of classification models on out-of-distribution test sets without depending on ground truth labels. We propose incorporating responses from neighboring test samples into the correctness assessment of each individual sample. The resulting scores are then averaged across all test samples to provide a holistic indication of model accuracy.
arXiv Detail & Related papers (2024-06-13T15:58:37Z)
PQMass: Probabilistic Assessment of the Quality of Generative Models using Probability Mass Estimation [7.143427689586699]
We propose a likelihood-free method for comparing two distributions given samples from each.<n>PQMass divides the sample space into non-overlapping regions and applies chi-squared tests to the number of data samples that fall within each region.<n>We show that PQMass scales well to moderately high-dimensional data and thus obviates the need for feature extraction in practical applications.
arXiv Detail & Related papers (2024-02-06T19:39:26Z)
Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z)
Statistical Model Criticism of Variational Auto-Encoders [15.005894753472894]
We propose a framework for the statistical evaluation of variational auto-encoders (VAEs) We test two instances of this framework in the context of modelling images of handwritten digits and a corpus of English text.
arXiv Detail & Related papers (2022-04-06T18:19:29Z)
BRIO: Bringing Order to Abstractive Summarization [107.97378285293507]
We propose a novel training paradigm which assumes a non-deterministic distribution. Our method achieves a new state-of-the-art result on the CNN/DailyMail (47.78 ROUGE-1) and XSum (49.07 ROUGE-1) datasets.
arXiv Detail & Related papers (2022-03-31T05:19:38Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
CARMS: Categorical-Antithetic-REINFORCE Multi-Sample Gradient Estimator [60.799183326613395]
We propose an unbiased estimator for categorical random variables based on multiple mutually negatively correlated (jointly antithetic) samples. CARMS combines REINFORCE with copula based sampling to avoid duplicate samples and reduce its variance, while keeping the estimator unbiased using importance sampling. We evaluate CARMS on several benchmark datasets on a generative modeling task, as well as a structured output prediction task, and find it to outperform competing methods including a strong self-control baseline.
arXiv Detail & Related papers (2021-10-26T20:14:30Z)
Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model. We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z)
A Case Study on Sampling Strategies for Evaluating Neural Sequential Item Recommendation Models [69.32128532935403]
Two well-known strategies to sample negative items are uniform random sampling and sampling by popularity. We re-evaluate current state-of-the-art sequential recommender models from the point of view. We find that both sampling strategies can produce inconsistent rankings compared with the full ranking of the models.
arXiv Detail & Related papers (2021-07-27T19:06:03Z)
Two-Sample Testing on Ranked Preference Data and the Role of Modeling Assumptions [57.77347280992548]
In this paper, we design two-sample tests for pairwise comparison data and ranking data. Our test requires essentially no assumptions on the distributions. By applying our two-sample test on real-world pairwise comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently.
arXiv Detail & Related papers (2020-06-21T20:51:09Z)
Learning Ising models from one or multiple samples [26.00403702328348]
We provide guarantees for one-sample estimation, quantifying the estimation error in terms of the metric entropy of a family of interaction matrices. Our technical approach benefits from sparsifying a model's interaction network, conditioning on subsets of variables that make the dependencies in the resulting conditional distribution sufficiently weak.
arXiv Detail & Related papers (2020-04-20T15:17:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.