BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language
- URL: http://arxiv.org/abs/2412.08329v1
- Date: Wed, 11 Dec 2024 12:15:57 GMT
- Title: BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language
- Authors: Nikolay Banar, Ehsan Lotfi, Walter Daelemans,
- Abstract summary: We introduce BEIR-NL by automatically translating the publicly accessible BEIR datasets into Dutch.
We evaluate a wide range of multilingual dense ranking and reranking models, as well as the lexical BM25 method.
- Score: 3.3990813930813997
- License:
- Abstract: Zero-shot evaluation of information retrieval (IR) models is often performed using BEIR; a large and heterogeneous benchmark composed of multiple datasets, covering different retrieval tasks across various domains. Although BEIR has become a standard benchmark for the zero-shot setup, its exclusively English content reduces its utility for underrepresented languages in IR, including Dutch. To address this limitation and encourage the development of Dutch IR models, we introduce BEIR-NL by automatically translating the publicly accessible BEIR datasets into Dutch. Using BEIR-NL, we evaluated a wide range of multilingual dense ranking and reranking models, as well as the lexical BM25 method. Our experiments show that BM25 remains a competitive baseline, and is only outperformed by the larger dense models trained for retrieval. When combined with reranking models, BM25 achieves performance on par with the best dense ranking models. In addition, we explored the impact of translation on the data by back-translating a selection of datasets to English, and observed a performance drop for both dense and lexical methods, indicating the limitations of translation for creating benchmarks. BEIR-NL is publicly available on the Hugging Face hub.
Related papers
- Bilingual BSARD: Extending Statutory Article Retrieval to Dutch [3.11149191866066]
This dataset contains parallel Belgian statutory articles in both French and Dutch, along with legal questions from BSARD and their Dutch translation.
We conduct extensive benchmarking of retrieval models available for Dutch and French.
Our experiments show that BM25 remains a competitive baseline compared to many zero-shot dense models in both languages.
arXiv Detail & Related papers (2024-12-10T12:31:33Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language [4.720913027054481]
In this work, inspired by mMARCO and Mr.TyDi datasets, we translated all accessible open IR datasets into Polish.
We introduced the BEIR-PL benchmark -- a new benchmark which comprises 13 datasets.
We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark.
arXiv Detail & Related papers (2023-05-31T13:29:07Z) - DUMB: A Benchmark for Smart Evaluation of Dutch Models [23.811515104842826]
We introduce the Dutch Model Benchmark: DUMB. The benchmark includes a diverse set of datasets for low-, medium- and high-resource tasks.
Relative Error Reduction (RER) compares the DUMB performance of language models to a strong baseline.
Highest performance is achieved by DeBERTaV3 (large), XLM-R (large) and mDeBERTaV3 (base)
arXiv Detail & Related papers (2023-05-22T13:27:37Z) - Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially
Code-Switched Data [26.38449396649045]
We show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages.
Motivated by this, we propose to train ranking models on artificially code-switched data instead.
arXiv Detail & Related papers (2023-05-09T09:32:19Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval [51.004601358498135]
Mr. TyDi is a benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages.
The goal of this resource is to spur research in dense retrieval techniques in non-English languages.
arXiv Detail & Related papers (2021-08-19T16:53:43Z) - BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information
Retrieval Models [41.45240621979654]
We introduce BEIR, a heterogeneous benchmark for information retrieval.
We study the effectiveness of nine state-of-the-art retrieval models in a zero-shot evaluation setup.
Dense-retrieval models are computationally more efficient but often underperform other approaches.
arXiv Detail & Related papers (2021-04-17T23:29:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.