Variations in Relevance Judgments and the Shelf Life of Test Collections
- URL: http://arxiv.org/abs/2502.20937v1
- Date: Fri, 28 Feb 2025 10:46:56 GMT
- Title: Variations in Relevance Judgments and the Shelf Life of Test Collections
- Authors: Andrew Parry, Maik Fröbe, Harrisen Scells, Ferdinand Schlatt, Guglielmo Faggioli, Saber Zerhoudi, Sean MacAvaney, Eugene Yang,
- Abstract summary: paradigm shift towards neural retrieval models affected the characteristics of modern test collections.<n>We reproduce prior work in the neural retrieval setting, showing that assessor disagreement does not affect system rankings.<n>We observe that some models substantially degrade with our new relevance judgments, and some have already reached the effectiveness of humans as rankers.
- Score: 50.060833338921945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The fundamental property of Cranfield-style evaluations, that system rankings are stable even when assessors disagree on individual relevance decisions, was validated on traditional test collections. However, the paradigm shift towards neural retrieval models affected the characteristics of modern test collections, e.g., documents are short, judged with four grades of relevance, and information needs have no descriptions or narratives. Under these changes, it is unclear whether assessor disagreement remains negligible for system comparisons. We investigate this aspect under the additional condition that the few modern test collections are heavily re-used. Given more possible query interpretations due to less formalized information needs, an ''expiration date'' for test collections might be needed if top-effectiveness requires overfitting to a single interpretation of relevance. We run a reproducibility study and re-annotate the relevance judgments of the 2019 TREC Deep Learning track. We can reproduce prior work in the neural retrieval setting, showing that assessor disagreement does not affect system rankings. However, we observe that some models substantially degrade with our new relevance judgments, and some have already reached the effectiveness of humans as rankers, providing evidence that test collections can expire.
Related papers
- Improving the Reusability of Conversational Search Test Collections [9.208308067952155]
Incomplete relevance judgments limit the reusability of test collections.
This is due to pockets of unjudged documents (called holes) in the test collection that the new systems return.
We employ Large Language Models (LLMs) to fill holes by leveraging existing judgments.
arXiv Detail & Related papers (2025-03-12T23:36:40Z) - On the Statistical Significance with Relevance Assessments of Large Language Models [2.9180406633632523]
We use Large Language Models for labelling relevance of documents for building new retrieval test collections.
Our results show that LLM judgements detect most of the significant differences while maintaining acceptable numbers of false positives.
Our work represents a step forward in the evaluation of statistical testing results provided by LLM judgements.
arXiv Detail & Related papers (2024-11-20T11:19:35Z) - Can We Use Large Language Models to Fill Relevance Judgment Holes? [9.208308067952155]
We take initial steps towards extending existing test collections by employing Large Language Models (LLM) to fill the holes.
We find substantially lower correlates when human plus automatic judgments are used.
arXiv Detail & Related papers (2024-05-09T07:39:19Z) - No Agreement Without Loss: Learning and Social Choice in Peer Review [0.0]
It may be assumed that each reviewer has her own mapping from the set of features to a recommendation.
This introduces an element of arbitrariness known as commensuration bias.
Noothigattu, Shah and Procaccia proposed to aggregate reviewer's mapping by minimizing certain loss functions.
arXiv Detail & Related papers (2022-11-03T21:03:23Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - Just Rank: Rethinking Evaluation with Word and Sentence Similarities [105.5541653811528]
intrinsic evaluation for embeddings lags far behind, and there has been no significant update since the past decade.
This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations.
We propose a new intrinsic evaluation method called EvalRank, which shows a much stronger correlation with downstream tasks.
arXiv Detail & Related papers (2022-03-05T08:40:05Z) - On Quantitative Evaluations of Counterfactuals [88.42660013773647]
This paper consolidates work on evaluating visual counterfactual examples through an analysis and experiments.
We find that while most metrics behave as intended for sufficiently simple datasets, some fail to tell the difference between good and bad counterfactuals when the complexity increases.
We propose two new metrics, the Label Variation Score and the Oracle score, which are both less vulnerable to such tiny changes.
arXiv Detail & Related papers (2021-10-30T05:00:36Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - A Sober Look at the Unsupervised Learning of Disentangled
Representations and their Evaluation [63.042651834453544]
We show that the unsupervised learning of disentangled representations is impossible without inductive biases on both the models and the data.
We observe that while the different methods successfully enforce properties "encouraged" by the corresponding losses, well-disentangled models seemingly cannot be identified without supervision.
Our results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision.
arXiv Detail & Related papers (2020-10-27T10:17:15Z) - On the Reliability of Test Collections for Evaluating Systems of
Different Types [34.38281205776437]
Test collections are generated based on pooling results of various retrieval systems, but until recently this did not include deep learning systems.
This paper uses simulated pooling to test the fairness and reusability of test collections, showing that pooling based on traditional systems only can lead to biased evaluation of deep learning systems.
arXiv Detail & Related papers (2020-04-28T13:22:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.