Two-Sample Testing on Ranked Preference Data and the Role of Modeling
Assumptions
- URL: http://arxiv.org/abs/2006.11909v2
- Date: Thu, 19 Nov 2020 02:42:32 GMT
- Title: Two-Sample Testing on Ranked Preference Data and the Role of Modeling
Assumptions
- Authors: Charvi Rastogi, Sivaraman Balakrishnan, Nihar B. Shah, Aarti Singh
- Abstract summary: In this paper, we design two-sample tests for pairwise comparison data and ranking data.
Our test requires essentially no assumptions on the distributions.
By applying our two-sample test on real-world pairwise comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently.
- Score: 57.77347280992548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A number of applications require two-sample testing on ranked preference
data. For instance, in crowdsourcing, there is a long-standing question of
whether pairwise comparison data provided by people is distributed similar to
ratings-converted-to-comparisons. Other examples include sports data analysis
and peer grading. In this paper, we design two-sample tests for pairwise
comparison data and ranking data. For our two-sample test for pairwise
comparison data, we establish an upper bound on the sample complexity required
to correctly distinguish between the distributions of the two sets of samples.
Our test requires essentially no assumptions on the distributions. We then
prove complementary lower bounds showing that our results are tight (in the
minimax sense) up to constant factors. We investigate the role of modeling
assumptions by proving lower bounds for a range of pairwise comparison models
(WST, MST,SST, parameter-based such as BTL and Thurstone). We also provide
testing algorithms and associated sample complexity bounds for the problem of
two-sample testing with partial (or total) ranking data.Furthermore, we
empirically evaluate our results via extensive simulations as well as two
real-world datasets consisting of pairwise comparisons. By applying our
two-sample test on real-world pairwise comparison data, we conclude that
ratings and rankings provided by people are indeed distributed differently. On
the other hand, our test recognizes no significant difference in the relative
performance of European football teams across two seasons. Finally, we apply
our two-sample test on a real-world partial and total ranking dataset and find
a statistically significant difference in Sushi preferences across demographic
divisions based on gender, age and region of residence.
Related papers
- Model Equality Testing: Which Model Is This API Serving? [59.005869726179455]
We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem.
A test built on a simple string kernel achieves a median of 77.4% power against a range of distortions.
We then apply this test to commercial inference APIs for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.
arXiv Detail & Related papers (2024-10-26T18:34:53Z) - A framework for paired-sample hypothesis testing for high-dimensional
data [7.400168551191579]
We put forward the idea that scoring functions can be produced by the decision rules defined by the bisecting hyperplanes of the line segments connecting each pair of instances.
First, we estimate the bisecting hyperplanes for each pair of instances and an aggregated rule derived through the Hodges-Lehmann estimator.
arXiv Detail & Related papers (2023-09-28T09:17:11Z) - Detecting Adversarial Data by Probing Multiple Perturbations Using
Expected Perturbation Score [62.54911162109439]
Adversarial detection aims to determine whether a given sample is an adversarial one based on the discrepancy between natural and adversarial distributions.
We propose a new statistic called expected perturbation score (EPS), which is essentially the expected score of a sample after various perturbations.
We develop EPS-based maximum mean discrepancy (MMD) as a metric to measure the discrepancy between the test sample and natural samples.
arXiv Detail & Related papers (2023-05-25T13:14:58Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Active Sequential Two-Sample Testing [18.99517340397671]
We consider the two-sample testing problem in a new scenario where sample measurements are inexpensive to access.
We devise the first emphactiveNIST-sample testing framework that not only sequentially but also emphactively queries.
In practice, we introduce an instantiation of our framework and evaluate it using several experiments.
arXiv Detail & Related papers (2023-01-30T02:23:49Z) - Listen, Adapt, Better WER: Source-free Single-utterance Test-time
Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples.
Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z) - Robust Fairness-aware Learning Under Sample Selection Bias [17.09665420515772]
We propose a framework for robust and fair learning under sample selection bias.
We develop two algorithms to handle sample selection bias when test data is both available and unavailable.
arXiv Detail & Related papers (2021-05-24T23:23:36Z) - Significance tests of feature relevance for a blackbox learner [6.72450543613463]
We derive two consistent tests for the feature relevance of a blackbox learner.
The first evaluates a loss difference with perturbation on an inference sample.
The second splits the inference sample into two but does not require data perturbation.
arXiv Detail & Related papers (2021-03-02T00:59:19Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - Preference Modeling with Context-Dependent Salient Features [12.403492796441434]
We consider the problem of estimating a ranking on a set of items from noisy pairwise comparisons given item features.
Our key observation is that two items compared in isolation from other items may be compared based on only a salient subset of features.
arXiv Detail & Related papers (2020-02-22T04:05:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.