SuiteEval: Simplifying Retrieval Benchmarks
- URL: http://arxiv.org/abs/2602.18107v1
- Date: Fri, 20 Feb 2026 09:54:16 GMT
- Title: SuiteEval: Simplifying Retrieval Benchmarks
- Authors: Andrew Parry, Debasis Ganguly, Sean MacAvaney,
- Abstract summary: SuiteEval is a unified framework that offers automatic end-to-end evaluation.<n>It handles data loading, indexing, ranking, metric computation, and result aggregation.
- Score: 29.90486933379759
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Information retrieval evaluation often suffers from fragmented practices -- varying dataset subsets, aggregation methods, and pipeline configurations -- that undermine reproducibility and comparability, especially for foundation embedding models requiring robust out-of-domain performance. We introduce SuiteEval, a unified framework that offers automatic end-to-end evaluation, dynamic indexing that reuses on-disk indices to minimise disk usage, and built-in support for major benchmarks (BEIR, LoTTE, MS MARCO, NanoBEIR, and BRIGHT). Users only need to supply a pipeline generator. SuiteEval handles data loading, indexing, ranking, metric computation, and result aggregation. New benchmark suites can be added in a single line. SuiteEval reduces boilerplate and standardises evaluations to facilitate reproducible IR research, as a broader benchmark set is increasingly required.
Related papers
- IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation [85.56193980646981]
We propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following.<n>For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses.<n>Experiments on IF-RewardBench reveal significant deficiencies in current judge models.
arXiv Detail & Related papers (2026-03-05T02:21:17Z) - DEP: A Decentralized Large Language Model Evaluation Protocol [51.3646001384887]
Decentralized Evaluation Protocol (DEP) is a decentralized yet unified and standardized evaluation framework.<n>By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation.<n>We develop DEP Toolkit, a protocol-compatible toolkit that supports features such as breakpoint resume, concurrent requests, and congestion control.
arXiv Detail & Related papers (2026-03-01T16:10:16Z) - Easy Data Unlearning Bench [53.1304932656586]
We introduce a unified and benchmarking suite that simplifies the evaluation of unlearning algorithms.<n>By standardizing setup and metrics, it enables reproducible, scalable, and fair comparison across unlearning methods.
arXiv Detail & Related papers (2026-02-18T12:20:32Z) - Less is more: Not all samples are effective for evaluation [1.6456338609651404]
Existing compression methods depend on correctness labels from multiple historical models evaluated on the full test set.<n>We propose a history-free test set compression framework that requires no prior model performance data.<n>Our approach effectively identifies and removes redundant samples, reducing evaluation cost by over 90%.
arXiv Detail & Related papers (2025-12-22T08:04:05Z) - Structured Prompting Enables More Robust Evaluation of Language Models [38.53918044830268]
We present a DSPy+HELM framework that introduces structured prompting methods which elicit reasoning.<n>We find that without structured prompting, HELM underestimates LM performance (by 4% average) and performance estimates vary more across benchmarks.<n>This is the first benchmarking study to systematically integrate structured prompting into an established evaluation framework.
arXiv Detail & Related papers (2025-11-25T20:37:59Z) - Bencher: Simple and Reproducible Benchmarking for Black-Box Optimization [5.703483582960509]
Bencher is a modular benchmarking framework for black-box optimization.<n>Each benchmark is isolated in its own virtual Python environment and accessed via a unified, version-agnostic remote procedure call (RPC) interface.<n>Bencher can be deployed locally or remotely via Docker or on high-performance computing clusters via Singularity.
arXiv Detail & Related papers (2025-05-27T15:18:58Z) - VIBE: Vector Index Benchmark for Embeddings [5.449089394751681]
We introduce Vector Index Benchmark for Embeddings (VIBE), an open source project for benchmarking ANN algorithms.<n>VIBE contains a pipeline for creating benchmark datasets using dense embedding models characteristic of modern applications.<n>We use VIBE to conduct a comprehensive evaluation of SOTA vector indexes, benchmarking 21 implementations on 12 in-distribution and 6 out-of-distribution datasets.
arXiv Detail & Related papers (2025-05-23T12:28:10Z) - Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards [5.632231145349045]
This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP.
Existing relation extraction benchmarks often suffer from insufficient documentation and lack crucial details.
While our discussion centers on the transparency of RE benchmarks and leaderboards, the observations we discuss are broadly applicable to other NLP tasks as well.
arXiv Detail & Related papers (2024-11-07T22:36:19Z) - Semi-Parametric Retrieval via Binary Bag-of-Tokens Index [71.78109794895065]
SemI-parametric Disentangled Retrieval (SiDR) is a bi-encoder retrieval framework that decouples retrieval index from neural parameters.<n>SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness.
arXiv Detail & Related papers (2024-05-03T08:34:13Z) - Evaluating Retrieval Quality in Retrieval-Augmented Generation [21.115495457454365]
Traditional end-to-end evaluation methods are computationally expensive.
We propose eRAG, where each document in the retrieval list is individually utilized by the large language model within the RAG system.
eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation.
arXiv Detail & Related papers (2024-04-21T21:22:28Z) - How to Prune Your Language Model: Recovering Accuracy on the "Sparsity
May Cry'' Benchmark [60.72725673114168]
We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets.
We propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark.
arXiv Detail & Related papers (2023-12-21T03:11:30Z) - SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot
Neural Sparse Retrieval [92.27387459751309]
We provide SPRINT, a unified Python toolkit for evaluating neural sparse retrieval.
We establish strong and reproducible zero-shot sparse retrieval baselines across the well-acknowledged benchmark, BEIR.
We show that SPLADEv2 produces sparse representations with a majority of tokens outside of the original query and document.
arXiv Detail & Related papers (2023-07-19T22:48:02Z) - Dynabench: Rethinking Benchmarking in NLP [82.26699038776812]
We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking.
Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation.
We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform.
arXiv Detail & Related papers (2021-04-07T17:49:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.