Related papers: BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

URL: http://arxiv.org/abs/2104.08663v1
Date: Sat, 17 Apr 2021 23:29:55 GMT
Title: BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
Authors: Nandan Thakur, Nils Reimers, Andreas R\"uckl\'e, Abhishek Srivastava, Iryna Gurevych
Abstract summary: We introduce BEIR, a heterogeneous benchmark for information retrieval. We study the effectiveness of nine state-of-the-art retrieval models in a zero-shot evaluation setup. Dense-retrieval models are computationally more efficient but often underperform other approaches.
Score: 41.45240621979654
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Neural IR models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their generalization capabilities. To address this, and to allow researchers to more broadly establish the effectiveness of their models, we introduce BEIR (Benchmarking IR), a heterogeneous benchmark for information retrieval. We leverage a careful selection of 17 datasets for evaluation spanning diverse retrieval tasks including open-domain datasets as well as narrow expert domains. We study the effectiveness of nine state-of-the-art retrieval models in a zero-shot evaluation setup on BEIR, finding that performing well consistently across all datasets is challenging. Our results show BM25 is a robust baseline and Reranking-based models overall achieve the best zero-shot performances, however, at high computational costs. In contrast, Dense-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. In this work, we extensively analyze different retrieval models and provide several suggestions that we believe may be useful for future work. BEIR datasets and code are available at https://github.com/UKPLab/beir.

Related papers

Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis [36.689210473887904]
We introduce a benchmarking framework for evaluating cross-dataset prediction generalization in deep learning (DL) and machine learning (ML) models. We quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results) Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments.
arXiv Detail & Related papers (2025-03-18T15:40:18Z)
Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations. They generate only a limited range of perturbations for a single Information Extraction (IE) task. Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench. We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z)
Quantifying the Capability Boundary of DeepSeek Models: An Application-Driven Performance Analysis [12.79754082920348]
We conduct a systematic evaluation of the DeepSeek-V3, DeepSeek-R1, DeepSeek-R1-Distill-Qwen series, DeepSeek-R1-Distill-Llama series, and the reasoning model QwQ-32B. We quantify the capability boundary of DeepSeek models through performance tier classifications. We develop a model selection handbook that clearly illustrates the relation among models, their capabilities and practical applications.
arXiv Detail & Related papers (2025-02-16T15:29:58Z)
Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark [53.876493664396506]
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions. This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context. We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement. To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques.
arXiv Detail & Related papers (2025-01-02T17:01:06Z)
Revisiting BPR: A Replicability Study of a Common Recommender System Baseline [78.00363373925758]
We study the features of the BPR model, indicating their impact on its performance, and investigate open-source BPR implementations. Our analysis reveals inconsistencies between these implementations and the original BPR paper, leading to a significant decrease in performance of up to 50% for specific implementations. We show that the BPR model can achieve performance levels close to state-of-the-art methods on the top-n recommendation tasks and even outperform them on specific datasets.
arXiv Detail & Related papers (2024-09-21T18:39:53Z)
On Evaluation of Vision Datasets and Models using Human Competency Frameworks [20.802372291783488]
Item Response Theory (IRT) is a framework that infers interpretable latent parameters for an ensemble of models and each dataset item. We assess model calibration, select informative data subsets, and demonstrate the usefulness of its latent parameters for analyzing and comparing models and datasets in computer vision.
arXiv Detail & Related papers (2024-09-06T06:20:11Z)
Multi-document Summarization: A Comparative Evaluation [0.0]
This paper is aimed at evaluating state-of-the-art models for Multi-document Summarization (MDS) on different types of datasets in various domains. We analyzed the performance of PRIMERA and PEG models on Big-Survey and MS$2$ datasets.
arXiv Detail & Related papers (2023-09-10T07:43:42Z)
Performance of different machine learning methods on activity recognition and pose estimation datasets [0.0]
This paper employs both classical and ensemble approaches on rich pose estimation (OpenPose) and HAR datasets. The results show that overall, random forest yields the highest accuracy in classifying ADLs. Relatively all the models have excellent performance across both datasets, except for logistic regression and AdaBoost perform poorly in the HAR one.
arXiv Detail & Related papers (2022-10-19T02:07:43Z)
HyperImpute: Generalized Iterative Imputation with Automatic Model Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models. We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z)
A Thorough Examination on Zero-shot Dense Retrieval [84.70868940598143]
We present the first thorough examination of the zero-shot capability of dense retrieval (DR) models. We discuss the effect of several key factors related to source training set, analyze the potential bias from the target dataset, and review and compare existing zero-shot DR models.
arXiv Detail & Related papers (2022-04-27T07:59:07Z)
Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis [17.811597734603144]
We propose an approach to automatically generating counterfactual data for data augmentation and explanation. A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance.
arXiv Detail & Related papers (2021-06-29T10:27:01Z)
Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z)
Rethinking Evaluation in ASR: Are Our Models Robust Enough? [30.114009549372923]
We show that, in general, reverberative and additive noise augmentation improves generalization performance across domains. We demonstrate that when a large enough set of benchmarks is used, average word error rate (WER) performance over them provides a good proxy for performance on real-world noisy data.
arXiv Detail & Related papers (2020-10-22T14:01:32Z)
Rethinking Generalization of Neural Models: A Named Entity Recognition Case Study [81.11161697133095]
We take the NER task as a testbed to analyze the generalization behavior of existing models from different perspectives. Experiments with in-depth analyses diagnose the bottleneck of existing neural NER models. As a by-product of this paper, we have open-sourced a project that involves a comprehensive summary of recent NER papers.
arXiv Detail & Related papers (2020-01-12T04:33:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.