Related papers: The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding

The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding

URL: http://arxiv.org/abs/2406.02396v1
Date: Tue, 4 Jun 2024 15:11:27 GMT
Title: The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding
Authors: Kenneth Enevoldsen, Márton Kardos, Niklas Muennighoff, Kristoffer Laigaard Nielbo,
Abstract summary: Scandinavian Embedding Benchmark (SEB) is a framework that enables text embedding evaluation for Scandinavian languages. Building on SEB, we evaluate more than 26 models, uncovering significant performance disparities between public and commercial solutions. We open-source SEB and integrate it with MTEB, thus bridging the text embedding evaluation gap for Scandinavian languages.
Score: 8.097049661773465
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The evaluation of English text embeddings has transitioned from evaluating a handful of datasets to broad coverage across many tasks through benchmarks such as MTEB. However, this is not the case for multilingual text embeddings due to a lack of available benchmarks. To address this problem, we introduce the Scandinavian Embedding Benchmark (SEB). SEB is a comprehensive framework that enables text embedding evaluation for Scandinavian languages across 24 tasks, 10 subtasks, and 4 task categories. Building on SEB, we evaluate more than 26 models, uncovering significant performance disparities between public and commercial solutions not previously captured by MTEB. We open-source SEB and integrate it with MTEB, thus bridging the text embedding evaluation gap for Scandinavian languages.

Related papers

Evaluating Text Style Transfer: A Nine-Language Benchmark for Text Detoxification [66.69370876902222]
We perform the first comprehensive multilingual study on evaluation of text detoxification system across nine languages.<n>We assess the effectiveness of modern neural-based evaluation models alongside prompting-based LLM-as-a-judge approaches.<n>Our findings provide a practical recipe for designing more reliable multilingual TST evaluation pipeline.
arXiv Detail & Related papers (2025-07-21T12:38:07Z)
MMTEB: Massive Multilingual Text Embedding Benchmark [85.18187649328792]
We introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) MMTEB covers over 500 quality-controlled evaluation tasks across 250+ languages. We develop several highly multilingual benchmarks, which we use to evaluate a representative set of models.
arXiv Detail & Related papers (2025-02-19T10:13:43Z)
FaMTEB: Massive Text Embedding Benchmark in Persian Language [9.204800002382042]
This paper introduces a comprehensive benchmark for Persian (Farsi) text embeddings built upon the Massive Text Embedding Benchmark (MTEB) Our benchmark includes 63 datasets spanning seven different tasks. We evaluate the performance of several Persian and multilingual embedding models in a range of tasks.
arXiv Detail & Related papers (2025-02-17T09:05:21Z)
A comparison of translation performance between DeepL and Supertext [3.858812369171884]
This study compares two commercial machine translation systems -- DeepL and Supertext. We evaluate translation quality across four language directions with professional translators assessing segments with full document-level context. While segment-level assessments indicate no strong preference between the systems in most cases, document-level analysis reveals a preference for Supertext in three out of four language directions.
arXiv Detail & Related papers (2025-02-04T18:53:42Z)
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning [72.57452266982642]
We introduce OCRBench v2, a large-scale bilingual text-centric benchmark for text recognition. We find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations.
arXiv Detail & Related papers (2024-12-31T07:32:35Z)
BERT or FastText? A Comparative Analysis of Contextual as well as Non-Contextual Embeddings [0.4194295877935868]
The choice of embeddings plays a critical role in enhancing the performance of NLP tasks. In this study, we investigate the impact of various embedding techniques- Contextual BERT-based, Non-Contextual BERT-based, and FastText-based on NLP classification tasks specific to the Marathi language.
arXiv Detail & Related papers (2024-11-26T18:25:57Z)
ROAST: Review-level Opinion Aspect Sentiment Target Joint Detection for ABSA [50.90538760832107]
This research presents a novel task, Review-Level Opinion Aspect Sentiment Target (ROAST) ROAST seeks to close the gap between sentence-level and text-level ABSA by identifying every ABSA constituent at the review level. We extend the available datasets to enable ROAST, addressing the drawbacks noted in previous research.
arXiv Detail & Related papers (2024-05-30T17:29:15Z)
PL-MTEB: Polish Massive Text Embedding Benchmark [0.0]
Polish Massive Text Embedding Benchmark (PL-MTEB) is a benchmark for text embeddings in Polish. PL-MTEB consists of 28 diverse NLP tasks from 5 task types.
arXiv Detail & Related papers (2024-05-16T14:33:39Z)
SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z)
Discourse Centric Evaluation of Machine Translation with a Densely Annotated Parallel Corpus [82.07304301996562]
This paper presents a new dataset with rich discourse annotations, built upon the large-scale parallel corpus BWB introduced in Jiang et al. We investigate the similarities and differences between the discourse structures of source and target languages. We discover that MT outputs differ fundamentally from human translations in terms of their latent discourse structures.
arXiv Detail & Related papers (2023-05-18T17:36:41Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings? [18.968571816913208]
We provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We show that a clever combination of sentence embeddings is usually better than encoding the full document as a single unit.
arXiv Detail & Related papers (2023-04-28T12:11:21Z)
MTEB: Massive Text Embedding Benchmark [6.023518635799927]
It is unclear whether state-of-the-art embeddings on semantic textual similarity can be equally well applied to other tasks like clustering or reranking. Massive Text Embedding Benchmark (MTEB) spans 8 embedding tasks covering a total of 58 datasets and 112 languages.
arXiv Detail & Related papers (2022-10-13T19:42:08Z)
FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation. The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z)
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.