FaMTEB: Massive Text Embedding Benchmark in Persian Language
- URL: http://arxiv.org/abs/2502.11571v1
- Date: Mon, 17 Feb 2025 09:05:21 GMT
- Title: FaMTEB: Massive Text Embedding Benchmark in Persian Language
- Authors: Erfan Zinvandi, Morteza Alikhani, Mehran Sarmadi, Zahra Pourbahman, Sepehr Arvin, Reza Kazemi, Arash Amini,
- Abstract summary: This paper introduces a comprehensive benchmark for Persian (Farsi) text embeddings built upon the Massive Text Embedding Benchmark (MTEB)
Our benchmark includes 63 datasets spanning seven different tasks.
We evaluate the performance of several Persian and multilingual embedding models in a range of tasks.
- Score: 9.204800002382042
- License:
- Abstract: In this paper, we introduce a comprehensive benchmark for Persian (Farsi) text embeddings, built upon the Massive Text Embedding Benchmark (MTEB). Our benchmark includes 63 datasets spanning seven different tasks: classification, clustering, pair classification, reranking, retrieval, summary retrieval, and semantic textual similarity. The datasets are formed as a combination of existing, translated, and newly generated data, offering a diverse evaluation framework for Persian language models. Given the increasing use of text embedding models in chatbots, evaluation datasets are becoming inseparable ingredients in chatbot challenges and Retrieval-Augmented Generation systems. As a contribution, we include chatbot evaluation datasets in the MTEB benchmark for the first time. In addition, in this paper, we introduce the new task of summary retrieval which is not part of the tasks included in standard MTEB. Another contribution of this paper is the introduction of a substantial number of new Persian language NLP datasets suitable for training and evaluation, some of which have no previous counterparts in Persian. We evaluate the performance of several Persian and multilingual embedding models in a range of tasks. This work introduces an open-source benchmark with datasets, code and a public leaderboard.
Related papers
- MMTEB: Massive Multilingual Text Embedding Benchmark [85.18187649328792]
We introduce the Massive Multilingual Text Embedding Benchmark (MMTEB)
MMTEB covers over 500 quality-controlled evaluation tasks across 250+ languages.
We develop several highly multilingual benchmarks, which we use to evaluate a representative set of models.
arXiv Detail & Related papers (2025-02-19T10:13:43Z) - Matina: A Large-Scale 73B Token Persian Text Corpus [1.396406461086233]
Existing Persian datasets are typically small and lack content diversity, consisting mainly of weblogs and news articles.
Matina corpus is a new Persian dataset of 72.9B tokens, carefully preprocessed and deduplicated to ensure high data quality.
arXiv Detail & Related papers (2025-02-13T11:22:19Z) - TextClass Benchmark: A Continuous Elo Rating of LLMs in Social Sciences [0.0]
The TextClass Benchmark project aims to provide a comprehensive, fair, and dynamic evaluation of LLMs and transformers for text classification tasks.
This evaluation spans various domains and languages in social sciences disciplines engaged in NLP and text-as-data approach.
The leaderboards present performance metrics and relative ranking using a tailored Elo rating system.
arXiv Detail & Related papers (2024-11-30T17:09:49Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Text classification dataset and analysis for Uzbek language [0.0]
We first present a newly obtained dataset for Uzbek text classification, which was collected from 10 different news and press websites.
We also present a comprehensive evaluation of different models, ranging from traditional bag-of-words models to deep learning architectures.
Our experiments show that the Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) based models outperform the rule-based models.
arXiv Detail & Related papers (2023-02-28T11:21:24Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - SMTCE: A Social Media Text Classification Evaluation Benchmark and
BERTology Models for Vietnamese [3.0938904602244355]
We introduce the Social Media Text Classification Evaluation (SMTCE) benchmark, as a collection of datasets and models across a diverse set of SMTC tasks.
We implement and analyze the effectiveness of a variety of multilingual BERT-based models and monolingual BERT-based models for tasks in the benchmark.
It provides an objective assessment of multilingual and monolingual BERT-based models on the benchmark, which will benefit future studies about BERTology in the Vietnamese language.
arXiv Detail & Related papers (2022-09-21T16:33:46Z) - SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation
on Natural Speech [44.68649535280397]
We propose a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE)
SLUE consists of limited-size labeled training sets and corresponding evaluation sets.
We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets.
We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.
arXiv Detail & Related papers (2021-11-19T18:59:23Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.