MELO: An Evaluation Benchmark for Multilingual Entity Linking of Occupations
- URL: http://arxiv.org/abs/2410.08319v1
- Date: Thu, 10 Oct 2024 19:14:54 GMT
- Title: MELO: An Evaluation Benchmark for Multilingual Entity Linking of Occupations
- Authors: Federico Retyk, Luis Gasco, Casimiro Pio Carrino, Daniel Deniz, Rabih Zbib,
- Abstract summary: We present a new collection of 48 datasets for evaluating the linking of entity mentions in 21 languages to the ESCO Occupations taxonomy.
MELO was built using high-quality, pre-existent human annotations.
- Score: 0.5528844566370006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the Multilingual Entity Linking of Occupations (MELO) Benchmark, a new collection of 48 datasets for evaluating the linking of entity mentions in 21 languages to the ESCO Occupations multilingual taxonomy. MELO was built using high-quality, pre-existent human annotations. We conduct experiments with simple lexical models and general-purpose sentence encoders, evaluated as bi-encoders in a zero-shot setup, to establish baselines for future research. The datasets and source code for standardized evaluation are publicly available at https://github.com/Avature/melo-benchmark
Related papers
- CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents [0.0]
We introduce a human-annotated benchmark based on Czech historical documents, containing human-defined topics together with manually annotated spans.<n>We evaluate a diverse range of large language models alongside BERT-based models fine-tuned on a distilled development dataset.<n>Results reveal substantial variability among LLMs, with performance ranging from near-human topic detection to pronounced failures in span localization.
arXiv Detail & Related papers (2026-03-04T09:35:47Z) - HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models [25.953042884928006]
We present an initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages.<n>At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data.<n>We train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models.
arXiv Detail & Related papers (2025-11-02T20:16:38Z) - MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation [13.440594349043916]
We develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG)
Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs)
We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs.
arXiv Detail & Related papers (2025-02-24T13:58:42Z) - MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems [43.19298196163617]
We develop MIRAGE-Bench, a standardized arena-based multilingual RAG benchmark for 18 diverse languages on Wikipedia.
Using this idea, We develop MIRAGE-Bench, a standardized arena-based multilingual RAG benchmark for 18 diverse languages on Wikipedia.
arXiv Detail & Related papers (2024-10-17T16:18:49Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - This is the way: designing and compiling LEPISZCZE, a comprehensive NLP
benchmark for Polish [5.8090623549313944]
We introduce LEPISZCZE, a new, comprehensive benchmark for Polish NLP.
We use five datasets from the Polish benchmark and add eight novel datasets.
We provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
arXiv Detail & Related papers (2022-11-23T16:51:09Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation [64.9546787488337]
We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation.
The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese.
arXiv Detail & Related papers (2022-10-01T05:02:04Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - X-FACT: A New Benchmark Dataset for Multilingual Fact Checking [21.2633064526968]
We introduce X-FACT: the largest publicly available multilingual dataset for factual verification of naturally existing real-world claims.
The dataset contains short statements in 25 languages and is labeled for veracity by expert fact-checkers.
arXiv Detail & Related papers (2021-06-17T05:09:54Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.