MTEB: Massive Text Embedding Benchmark
- URL: http://arxiv.org/abs/2210.07316v3
- Date: Sun, 19 Mar 2023 13:37:01 GMT
- Title: MTEB: Massive Text Embedding Benchmark
- Authors: Niklas Muennighoff, Nouamane Tazi, Lo\"ic Magne, Nils Reimers
- Abstract summary: It is unclear whether state-of-the-art embeddings on semantic textual similarity can be equally well applied to other tasks like clustering or reranking.
Massive Text Embedding Benchmark (MTEB) spans 8 embedding tasks covering a total of 58 datasets and 112 languages.
- Score: 6.023518635799927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text embeddings are commonly evaluated on a small set of datasets from a
single task not covering their possible applications to other tasks. It is
unclear whether state-of-the-art embeddings on semantic textual similarity
(STS) can be equally well applied to other tasks like clustering or reranking.
This makes progress in the field difficult to track, as various models are
constantly being proposed without proper evaluation. To solve this problem, we
introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding
tasks covering a total of 58 datasets and 112 languages. Through the
benchmarking of 33 models on MTEB, we establish the most comprehensive
benchmark of text embeddings to date. We find that no particular text embedding
method dominates across all tasks. This suggests that the field has yet to
converge on a universal text embedding method and scale it up sufficiently to
provide state-of-the-art results on all embedding tasks. MTEB comes with
open-source code and a public leaderboard at
https://github.com/embeddings-benchmark/mteb.
Related papers
- BERT or FastText? A Comparative Analysis of Contextual as well as Non-Contextual Embeddings [0.4194295877935868]
The choice of embeddings plays a critical role in enhancing the performance of NLP tasks.
In this study, we investigate the impact of various embedding techniques- Contextual BERT-based, Non-Contextual BERT-based, and FastText-based on NLP classification tasks specific to the Marathi language.
arXiv Detail & Related papers (2024-11-26T18:25:57Z) - The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding [8.097049661773465]
Scandinavian Embedding Benchmark (SEB) is a framework that enables text embedding evaluation for Scandinavian languages.
Building on SEB, we evaluate more than 26 models, uncovering significant performance disparities between public and commercial solutions.
We open-source SEB and integrate it with MTEB, thus bridging the text embedding evaluation gap for Scandinavian languages.
arXiv Detail & Related papers (2024-06-04T15:11:27Z) - The Power of Summary-Source Alignments [62.76959473193149]
Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection.
alignment of corresponding sentences between a reference summary and its source documents has been leveraged to generate training data.
This paper proposes extending the summary-source alignment framework by applying it at the more fine-grained proposition span level.
arXiv Detail & Related papers (2024-06-02T19:35:19Z) - Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction [36.915250638481986]
We introduce LiveSum, a new benchmark dataset for generating summary tables of competitions based on real-time commentary texts.
We evaluate the performances of state-of-the-art Large Language Models on this task in both fine-tuning and zero-shot settings.
We additionally propose a novel pipeline called $T3$(Text-Tuple-Table) to improve their performances.
arXiv Detail & Related papers (2024-04-22T14:31:28Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - RETSim: Resilient and Efficient Text Similarity [1.6228944467258688]
RETSim is a lightweight, multilingual deep learning model trained to produce robust metric embeddings for text retrieval, clustering, and dataset deduplication tasks.
We demonstrate that RETSim is significantly more robust and accurate than MinHash and neural text embeddings.
We also introduce the W4NT3D benchmark for evaluating multilingual, near-duplicate text retrieval capabilities under adversarial settings.
arXiv Detail & Related papers (2023-11-28T22:54:33Z) - MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation
of Videos [106.06278332186106]
Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction.
Numerous limitations exist within existing public MSMO datasets.
We have meticulously curated the textbfMMSum dataset.
arXiv Detail & Related papers (2023-06-07T07:43:11Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - One Embedder, Any Task: Instruction-Finetuned Text Embeddings [105.82772523968961]
INSTRUCTOR is a new method for computing text embeddings given task instructions.
Every text input is embedded together with instructions explaining the use case.
We evaluate INSTRUCTOR on 70 embedding evaluation tasks.
arXiv Detail & Related papers (2022-12-19T18:57:05Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.