Related papers: BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language

BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language

URL: http://arxiv.org/abs/2305.19840v2
Date: Thu, 16 May 2024 10:59:27 GMT
Title: BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language
Authors: Konrad Wojtasik, Vadim Shishkin, Kacper Wołowiec, Arkadiusz Janz, Maciej Piasecki,
Abstract summary: In this work, inspired by mMARCO and Mr.TyDi datasets, we translated all accessible open IR datasets into Polish. We introduced the BEIR-PL benchmark -- a new benchmark which comprises 13 datasets. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark.
Score: 4.720913027054481
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The BEIR dataset is a large, heterogeneous benchmark for Information Retrieval (IR) in zero-shot settings, garnering considerable attention within the research community. However, BEIR and analogous datasets are predominantly restricted to the English language. Our objective is to establish extensive large-scale resources for IR in the Polish language, thereby advancing the research in this NLP area. In this work, inspired by mMARCO and Mr.~TyDi datasets, we translated all accessible open IR datasets into Polish, and we introduced the BEIR-PL benchmark -- a new benchmark which comprises 13 datasets, facilitating further development, training and evaluation of modern Polish language models for IR tasks. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore, we publish pre-trained open IR models for Polish language,d marking a pioneering development in this field. Additionally, the evaluation revealed that BM25 achieved significantly lower scores for Polish than for English, which can be attributed to high inflection and intricate morphological structure of the Polish language. Finally, we trained various re-ranking models to enhance the BM25 retrieval, and we compared their performance to identify their unique characteristic features. To ensure accurate model comparisons, it is necessary to scrutinise individual results rather than to average across the entire benchmark. Thus, we thoroughly analysed the outcomes of IR models in relation to each individual data subset encompassed by the BEIR benchmark. The benchmark data is available at URL {\bf https://huggingface.co/clarin-knext}.

Related papers

Building Russian Benchmark for Evaluation of Information Retrieval Models [0.0]
RusBEIR is a benchmark for evaluation of information retrieval models in the Russian language. It integrates adapted, translated, and newly created datasets, enabling comparison of lexical and neural models.
arXiv Detail & Related papers (2025-04-17T12:11:14Z)
Enabling Low-Resource Language Retrieval: Establishing Baselines for Urdu MS MARCO [1.3791394805787949]
This paper introduces the first large-scale Urdu IR dataset, created by translating the MS MARCO dataset through machine translation. We establish baseline results through zero-shot learning for IR in Urdu and subsequently apply the mMARCO multilingual IR methodology to this newly translated dataset. Our findings demonstrate that the fine-tuned model (Urdu-mT5-mMARCO) achieves a Mean Reciprocal Rank (MRR@10) of 0.247 and a Recall@10 of 0.439, representing significant improvements over zero-shot results.
arXiv Detail & Related papers (2024-12-17T15:21:28Z)
BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language [3.3990813930813997]
We introduce BEIR-NL by automatically translating the publicly accessible BEIR datasets into Dutch. We evaluate a wide range of multilingual dense ranking and reranking models, as well as the lexical BM25 method.
arXiv Detail & Related papers (2024-12-11T12:15:57Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks. We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods [0.552480439325792]
We present Polish Information Retrieval Benchmark (PIRB), a comprehensive evaluation framework encompassing 41 text information retrieval tasks for Polish. The benchmark incorporates existing datasets as well as 10 new, previously unpublished datasets covering diverse topics such as medicine, law, business, physics, and linguistics. We conduct an extensive evaluation of over 20 dense and sparse retrieval models, including the baseline models trained by us.
arXiv Detail & Related papers (2024-02-20T19:53:36Z)
ExaRanker-Open: Synthetic Explanation for IR using Open-Source LLMs [60.81649785463651]
We introduce ExaRanker-Open, where we adapt and explore the use of open-source language models to generate explanations. Our findings reveal that incorporating explanations consistently enhances neural rankers, with benefits escalating as the LLM size increases.
arXiv Detail & Related papers (2024-02-09T11:23:14Z)
Retrieval-based Disentangled Representation Learning with Natural Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z)
This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish [5.8090623549313944]
We introduce LEPISZCZE, a new, comprehensive benchmark for Polish NLP. We use five datasets from the Polish benchmark and add eight novel datasets. We provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
arXiv Detail & Related papers (2022-11-23T16:51:09Z)
Evaluation of Transfer Learning for Polish with a Text-to-Text Model [54.81823151748415]
We introduce a new benchmark for assessing the quality of text-to-text models for Polish. The benchmark consists of diverse tasks and datasets: KLEJ benchmark adapted for text-to-text, en-pl translation, summarization, and question answering. We present plT5 - a general-purpose text-to-text model for Polish that can be fine-tuned on various Natural Language Processing (NLP) tasks with a single training objective.
arXiv Detail & Related papers (2022-05-18T09:17:14Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks [2.6189995284654737]
Leader-boards like SuperGLUE are seen as important incentives for active development of NLP. We show that its test datasets are vulnerable to shallows. It is likely (as the simplest explanation) that a significant part of the SOTA models performance in the RSG leader-board is due to exploiting these shallows.
arXiv Detail & Related papers (2021-05-03T22:19:22Z)
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models [41.45240621979654]
We introduce BEIR, a heterogeneous benchmark for information retrieval. We study the effectiveness of nine state-of-the-art retrieval models in a zero-shot evaluation setup. Dense-retrieval models are computationally more efficient but often underperform other approaches.
arXiv Detail & Related papers (2021-04-17T23:29:55Z)
Ranking Creative Language Characteristics in Small Data Scenarios [52.00161818003478]
We adapt the DirectRanker to provide a new deep model for ranking creative language with small data. Our experiments with sparse training data show that while the performance of standard neural ranking approaches collapses with small datasets, DirectRanker remains effective.
arXiv Detail & Related papers (2020-10-23T18:57:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.