MERA: A Comprehensive LLM Evaluation in Russian
- URL: http://arxiv.org/abs/2401.04531v2
- Date: Fri, 12 Jan 2024 15:04:43 GMT
- Title: MERA: A Comprehensive LLM Evaluation in Russian
- Authors: Alena Fenogenova, Artem Chervyakov, Nikita Martynov, Anastasia
Kozlova, Maria Tikhonova, Albina Akhmetgareeva, Anton Emelyanov, Denis
Shevelev, Pavel Lebedev, Leonid Sinev, Ulyana Isaeva, Katerina Kolomeytseva,
Daniil Moskovskiy, Elizaveta Goncharova, Nikita Savushkin, Polina Mikhailova,
Denis Dimitrov, Alexander Panchenko, Sergei Markov
- Abstract summary: We introduce an open Multimodal Evaluation of Russian-language Architectures (MERA) benchmark for evaluating foundation models.
The benchmark encompasses 21 evaluation tasks for generative models in 11 skill domains.
We propose an evaluation methodology, an open-source code base for the MERA assessment, and a leaderboard with a submission system.
- Score: 43.65236119370611
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the past few years, one of the most notable advancements in AI research
has been in foundation models (FMs), headlined by the rise of language models
(LMs). As the models' size increases, LMs demonstrate enhancements in
measurable aspects and the development of new qualitative features. However,
despite researchers' attention and the rapid growth in LM application, the
capabilities, limitations, and associated risks still need to be better
understood. To address these issues, we introduce an open Multimodal Evaluation
of Russian-language Architectures (MERA), a new instruction benchmark for
evaluating foundation models oriented towards the Russian language. The
benchmark encompasses 21 evaluation tasks for generative models in 11 skill
domains and is designed as a black-box test to ensure the exclusion of data
leakage. The paper introduces a methodology to evaluate FMs and LMs in zero-
and few-shot fixed instruction settings that can be extended to other
modalities. We propose an evaluation methodology, an open-source code base for
the MERA assessment, and a leaderboard with a submission system. We evaluate
open LMs as baselines and find that they are still far behind the human level.
We publicly release MERA to guide forthcoming research, anticipate
groundbreaking model features, standardize the evaluation procedure, and
address potential societal drawbacks.
Related papers
- LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models [71.8065384742686]
LMMS-EVAL is a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models.
LMMS-EVAL LITE is a pruned evaluation toolkit that emphasizes both coverage and efficiency.
Multimodal LIVEBENCH utilizes continuously updating news and online forums to assess models' generalization abilities in the wild.
arXiv Detail & Related papers (2024-07-17T17:51:53Z) - Belief Revision: The Adaptability of Large Language Models Reasoning [63.0281286287648]
We introduce Belief-R, a new dataset designed to test LMs' belief revision ability when presented with new evidence.
Inspired by how humans suppress prior inferences, this task assesses LMs within the newly proposed delta reasoning framework.
We evaluate $sim$30 LMs across diverse prompting strategies and found that LMs generally struggle to appropriately revise their beliefs in response to new information.
arXiv Detail & Related papers (2024-06-28T09:09:36Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks.
Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries.
These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity
and Infant Care [14.326936563564171]
We present a benchmark, CARE-MI, for evaluating misinformation in large language models (LLMs)
Our proposed benchmark fills the gap between the extensive usage of LLMs and the lack of datasets for assessing the misinformation generated by these models.
Using our benchmark, we conduct extensive experiments and found that current Chinese LLMs are far from perfect in the topic of maternity and infant care.
arXiv Detail & Related papers (2023-07-04T03:34:19Z) - Benchmarking Foundation Models with Language-Model-as-an-Examiner [47.345760054595246]
We propose a novel benchmarking framework, Language-Model-as-an-Examiner.
The LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner.
arXiv Detail & Related papers (2023-06-07T06:29:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.