On the Reliability of Test Collections for Evaluating Systems of
Different Types
- URL: http://arxiv.org/abs/2004.13486v1
- Date: Tue, 28 Apr 2020 13:22:26 GMT
- Title: On the Reliability of Test Collections for Evaluating Systems of
Different Types
- Authors: Emine Yilmaz, Nick Craswell, Bhaskar Mitra and Daniel Campos
- Abstract summary: Test collections are generated based on pooling results of various retrieval systems, but until recently this did not include deep learning systems.
This paper uses simulated pooling to test the fairness and reusability of test collections, showing that pooling based on traditional systems only can lead to biased evaluation of deep learning systems.
- Score: 34.38281205776437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As deep learning based models are increasingly being used for information
retrieval (IR), a major challenge is to ensure the availability of test
collections for measuring their quality. Test collections are generated based
on pooling results of various retrieval systems, but until recently this did
not include deep learning systems. This raises a major challenge for reusable
evaluation: Since deep learning based models use external resources (e.g. word
embeddings) and advanced representations as opposed to traditional methods that
are mainly based on lexical similarity, they may return different types of
relevant document that were not identified in the original pooling. If so, test
collections constructed using traditional methods are likely to lead to biased
and unfair evaluation results for deep learning (neural) systems. This paper
uses simulated pooling to test the fairness and reusability of test
collections, showing that pooling based on traditional systems only can lead to
biased evaluation of deep learning systems.
Related papers
- Variations in Relevance Judgments and the Shelf Life of Test Collections [50.060833338921945]
paradigm shift towards neural retrieval models affected the characteristics of modern test collections.
We reproduce prior work in the neural retrieval setting, showing that assessor disagreement does not affect system rankings.
We observe that some models substantially degrade with our new relevance judgments, and some have already reached the effectiveness of humans as rankers.
arXiv Detail & Related papers (2025-02-28T10:46:56Z) - GenTREC: The First Test Collection Generated by Large Language Models for Evaluating Information Retrieval Systems [0.33748750222488655]
GenTREC is the first test collection constructed entirely from documents generated by a Large Language Model (LLM)
We consider a document relevant only to the prompt that generated it, while other document-topic pairs are treated as non-relevant.
The resulting GenTREC collection comprises 96,196 documents, 300 topics, and 18,964 relevance "judgments"
arXiv Detail & Related papers (2025-01-05T00:27:36Z) - Machine Learning for predicting chaotic systems [0.0]
We show that well-tuned simple methods, as well as untuned baseline methods, often outperform state-of-the-art deep learning models.
These findings underscore the importance of matching prediction methods to data characteristics and available computational resources.
arXiv Detail & Related papers (2024-07-29T16:34:47Z) - Evaluating Generative Ad Hoc Information Retrieval [58.800799175084286]
generative retrieval systems often directly return a grounded generated text as a response to a query.
Quantifying the utility of the textual responses is essential for appropriately evaluating such generative ad hoc retrieval.
arXiv Detail & Related papers (2023-11-08T14:05:00Z) - Re-Benchmarking Pool-Based Active Learning for Binary Classification [27.034593234956713]
Active learning is a paradigm that significantly enhances the performance of machine learning models when acquiring labeled data.
While several benchmarks exist for evaluating active learning strategies, their findings exhibit some misalignment.
This discrepancy motivates us to develop a transparent and reproducible benchmark for the community.
arXiv Detail & Related papers (2023-06-15T08:47:50Z) - A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts [117.72709110877939]
Test-time adaptation (TTA) has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions.
We categorize TTA into several distinct groups based on the form of test data, namely, test-time domain adaptation, test-time batch adaptation, and online test-time adaptation.
arXiv Detail & Related papers (2023-03-27T16:32:21Z) - The Integration of Machine Learning into Automated Test Generation: A
Systematic Mapping Study [15.016047591601094]
We characterize emerging research, examining testing practices, researcher goals, ML techniques applied, evaluation, and challenges.
ML generates input for system, GUI, unit, performance, and testing or improves the performance of existing generation methods.
arXiv Detail & Related papers (2022-06-21T09:26:25Z) - Knowledge-based Document Classification with Shannon Entropy [0.0]
We propose a novel knowledge-based model equipped with Shannon Entropy, which measures the richness of information and favors uniform and diverse keyword matches.
We show that the Shannon Entropy significantly improves the recall at fixed level of false positive rate.
arXiv Detail & Related papers (2022-06-06T05:39:10Z) - General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space.
GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z) - An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols.
We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z) - Deep Keypoint-Based Camera Pose Estimation with Geometric Constraints [80.60538408386016]
Estimating relative camera poses from consecutive frames is a fundamental problem in visual odometry.
We propose an end-to-end trainable framework consisting of learnable modules for detection, feature extraction, matching and outlier rejection.
arXiv Detail & Related papers (2020-07-29T21:41:31Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.