MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
- URL: http://arxiv.org/abs/2502.17163v3
- Date: Tue, 29 Apr 2025 07:28:04 GMT
- Title: MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
- Authors: María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico,
- Abstract summary: We develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG)<n>Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs)<n>We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs.
- Score: 13.440594349043916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic evaluation of retrieval augmented generation (RAG) systems relies on fine-grained dimensions like faithfulness and relevance, as judged by expert human annotators. Meta-evaluation benchmarks support the development of automatic evaluators that correlate well with human judgement. However, existing benchmarks predominantly focus on English or use translated data, which fails to capture cultural nuances. A native approach provides a better representation of the end user experience. In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG). Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs), which are then assessed by expert annotators for faithfulness and relevance. We describe our annotation process and show that it achieves high inter-annotator agreement. We then analyse the performance of the answer-generating LLMs across languages as per the human evaluators. Finally we apply the dataset to our main use-case which is to benchmark multilingual automatic evaluators (LLM-as-a-judge). We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs. Our dataset is available at https://github.com/amazon-science/MEMERAG
Related papers
- Large Language Models as User-Agents for Evaluating Task-Oriented-Dialogue Systems [6.8738526619759535]
offline datasets have been used to evaluate task-oriented dialogue (TOD) models.
User-agents, which are context-aware, can simulate the variability and unpredictability of human conversations.
arXiv Detail & Related papers (2024-11-15T06:05:45Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems [43.19298196163617]
We develop MIRAGE-Bench, a standardized arena-based multilingual RAG benchmark for 18 diverse languages on Wikipedia.
Using this idea, We develop MIRAGE-Bench, a standardized arena-based multilingual RAG benchmark for 18 diverse languages on Wikipedia.
arXiv Detail & Related papers (2024-10-17T16:18:49Z) - Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs [29.72874725703848]
We introduce two key concepts: Benchmark+, which extends the traditional question-answer benchmark into a more flexible strategy-criterion'' format; and Assessment+, which enhances the interaction process.<n>We propose TestAgent, an agent-based evaluation framework that implements these concepts using retrieval-augmented generation and reinforcement learning.<n>TestAgent enables automatic dynamic benchmark generation and in-depth assessment across diverse vertical domain scenarios.
arXiv Detail & Related papers (2024-10-15T11:20:42Z) - Do Text-to-Vis Benchmarks Test Real Use of Visualisations? [11.442971909006657]
This paper investigates whether benchmarks reflect real-world use through an empirical study comparing benchmark datasets with code from public repositories.
Our findings reveal a substantial gap, with evaluations not testing the same distribution of chart types, attributes, and actions as real-world examples.
One dataset is representative, but requires extensive modification to become a practical end-to-end benchmark.
This shows that new benchmarks are needed to support the development of systems that truly address users' visualisation needs.
arXiv Detail & Related papers (2024-07-29T06:13:28Z) - Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark [12.729687989535359]
evaluating Large Language Models (LLMs) in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts.
We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy.
arXiv Detail & Related papers (2024-06-25T13:20:08Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.