Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks
- URL: http://arxiv.org/abs/2506.21182v1
- Date: Thu, 26 Jun 2025 12:40:48 GMT
- Title: Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks
- Authors: Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, Kenneth Enevoldsen,
- Abstract summary: The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models.<n>We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results' generalizability.<n>These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB's continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results' generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: https://github.com/embeddings-benchmark/mteb
Related papers
- FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation [17.64876163735292]
FrontendBench is a benchmark co-developed by humans and Large Language Models (LLMs)<n>The benchmark comprises 148 meticulously crafted prompt-test case pairs spanning five levels of web components.<n>An automatic evaluation framework executes generated code within a sandbox environment and assesses outcomes using predefined test scripts.
arXiv Detail & Related papers (2025-06-16T03:20:31Z) - LGAI-EMBEDDING-Preview Technical Report [41.68404082385825]
This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks.<n>Our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings.<n>Results show that our method achieves strong generalization and ranks among the top-performing models by Borda score.
arXiv Detail & Related papers (2025-06-09T05:30:35Z) - WritingBench: A Comprehensive Benchmark for Generative Writing [87.48445972563631]
We present WritingBench, a benchmark designed to evaluate large language models (LLMs) across 6 core writing domains and 100, encompassing creative, persuasive, informative, and technical writing.<n>We propose a query-dependent evaluation framework that empowers LLMs to dynamically generate instance-specific assessment criteria.<n>This framework is complemented by a fine-tuned critic model for criteria-aware scoring, enabling evaluations in style, format and length.
arXiv Detail & Related papers (2025-03-07T08:56:20Z) - Movie2Story: A framework for understanding videos and telling stories in the form of novel text [0.0]
We propose a novel benchmark to evaluate text generation capabilities in scenarios enriched with auxiliary information.<n>Our work introduces an innovative automatic dataset generation method to ensure the availability of accurate auxiliary information.<n>Our experiments reveal that current Multi-modal Large Language Models (MLLMs) perform suboptimally under the proposed evaluation metrics.
arXiv Detail & Related papers (2024-12-19T15:44:04Z) - BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices [28.70453947993952]
We develop an assessment framework considering 46 best practices across an AI benchmark's lifecycle and evaluate 24 AI benchmarks against it.
We find that there exist large quality differences and that commonly used benchmarks suffer from significant issues.
arXiv Detail & Related papers (2024-11-20T02:38:24Z) - Benchmarks as Microscopes: A Call for Model Metrology [76.64402390208576]
Modern language models (LMs) pose a new challenge in capability assessment.
To be confident in our metrics, we need a new discipline of model metrology.
arXiv Detail & Related papers (2024-07-22T17:52:12Z) - Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks.
We propose to automate dataset updating and provide systematic analysis regarding its effectiveness.
There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z) - How to Prune Your Language Model: Recovering Accuracy on the "Sparsity
May Cry'' Benchmark [60.72725673114168]
We revisit the question of accurate BERT-pruning during fine-tuning on downstream datasets.
We propose a set of general guidelines for successful pruning, even on the challenging SMC benchmark.
arXiv Detail & Related papers (2023-12-21T03:11:30Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - EditEval: An Instruction-Based Benchmark for Text Improvements [73.5918084416016]
This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities.
We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA.
Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
arXiv Detail & Related papers (2022-09-27T12:26:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.