BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information
Retrieval Models
- URL: http://arxiv.org/abs/2104.08663v1
- Date: Sat, 17 Apr 2021 23:29:55 GMT
- Title: BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information
Retrieval Models
- Authors: Nandan Thakur, Nils Reimers, Andreas R\"uckl\'e, Abhishek Srivastava,
Iryna Gurevych
- Abstract summary: We introduce BEIR, a heterogeneous benchmark for information retrieval.
We study the effectiveness of nine state-of-the-art retrieval models in a zero-shot evaluation setup.
Dense-retrieval models are computationally more efficient but often underperform other approaches.
- Score: 41.45240621979654
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Neural IR models have often been studied in homogeneous and narrow settings,
which has considerably limited insights into their generalization capabilities.
To address this, and to allow researchers to more broadly establish the
effectiveness of their models, we introduce BEIR (Benchmarking IR), a
heterogeneous benchmark for information retrieval. We leverage a careful
selection of 17 datasets for evaluation spanning diverse retrieval tasks
including open-domain datasets as well as narrow expert domains. We study the
effectiveness of nine state-of-the-art retrieval models in a zero-shot
evaluation setup on BEIR, finding that performing well consistently across all
datasets is challenging. Our results show BM25 is a robust baseline and
Reranking-based models overall achieve the best zero-shot performances,
however, at high computational costs. In contrast, Dense-retrieval models are
computationally more efficient but often underperform other approaches,
highlighting the considerable room for improvement in their generalization
capabilities. In this work, we extensively analyze different retrieval models
and provide several suggestions that we believe may be useful for future work.
BEIR datasets and code are available at https://github.com/UKPLab/beir.
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - On Evaluation of Vision Datasets and Models using Human Competency Frameworks [20.802372291783488]
Item Response Theory (IRT) is a framework that infers interpretable latent parameters for an ensemble of models and each dataset item.
We assess model calibration, select informative data subsets, and demonstrate the usefulness of its latent parameters for analyzing and comparing models and datasets in computer vision.
arXiv Detail & Related papers (2024-09-06T06:20:11Z) - Multi-document Summarization: A Comparative Evaluation [0.0]
This paper is aimed at evaluating state-of-the-art models for Multi-document Summarization (MDS) on different types of datasets in various domains.
We analyzed the performance of PRIMERA and PEG models on Big-Survey and MS$2$ datasets.
arXiv Detail & Related papers (2023-09-10T07:43:42Z) - Performance of different machine learning methods on activity
recognition and pose estimation datasets [0.0]
This paper employs both classical and ensemble approaches on rich pose estimation (OpenPose) and HAR datasets.
The results show that overall, random forest yields the highest accuracy in classifying ADLs.
Relatively all the models have excellent performance across both datasets, except for logistic regression and AdaBoost perform poorly in the HAR one.
arXiv Detail & Related papers (2022-10-19T02:07:43Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - A Thorough Examination on Zero-shot Dense Retrieval [84.70868940598143]
We present the first thorough examination of the zero-shot capability of dense retrieval (DR) models.
We discuss the effect of several key factors related to source training set, analyze the potential bias from the target dataset, and review and compare existing zero-shot DR models.
arXiv Detail & Related papers (2022-04-27T07:59:07Z) - Exploring the Efficacy of Automatically Generated Counterfactuals for
Sentiment Analysis [17.811597734603144]
We propose an approach to automatically generating counterfactual data for data augmentation and explanation.
A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance.
arXiv Detail & Related papers (2021-06-29T10:27:01Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Rethinking Evaluation in ASR: Are Our Models Robust Enough? [30.114009549372923]
We show that, in general, reverberative and additive noise augmentation improves generalization performance across domains.
We demonstrate that when a large enough set of benchmarks is used, average word error rate (WER) performance over them provides a good proxy for performance on real-world noisy data.
arXiv Detail & Related papers (2020-10-22T14:01:32Z) - Rethinking Generalization of Neural Models: A Named Entity Recognition
Case Study [81.11161697133095]
We take the NER task as a testbed to analyze the generalization behavior of existing models from different perspectives.
Experiments with in-depth analyses diagnose the bottleneck of existing neural NER models.
As a by-product of this paper, we have open-sourced a project that involves a comprehensive summary of recent NER papers.
arXiv Detail & Related papers (2020-01-12T04:33:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.