OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models
- URL: http://arxiv.org/abs/2402.13524v1
- Date: Wed, 21 Feb 2024 04:42:41 GMT
- Title: OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models
- Authors: Yang Liu, Meng Xu, Shuo Wang, Liner Yang, Haoyu Wang, Zhenghao Liu,
Cunliang Kong, Yun Chen, Yang Liu, Maosong Sun, Erhong Yang
- Abstract summary: We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
- Score: 59.54423478596468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern large language models (LLMs) should generally benefit individuals from
various cultural backgrounds around the world. However, most recent advanced
generative evaluation benchmarks tailed for LLMs mainly focus on English. To
this end, we introduce OMGEval, the first Open-source Multilingual Generative
test set that can assess the capability of LLMs in different languages. For
each language, OMGEval provides 804 open-ended questions, covering a wide range
of important capabilities of LLMs, such as general knowledge, logical
reasoning, and so on. Each question is rigorously verified by human annotators.
Notably, to sufficiently reflect the compatibility of LLMs in different
cultural backgrounds, we perform localization for each non-English language.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh,
Ru, Fr, Es, Ar). Following AlpacaEval, we employ GPT-4 as the adjudicator to
automatically score different model outputs, which is shown closely related to
human evaluation. We evaluate several representative multilingual LLMs on the
proposed OMGEval, which we believe will provide a valuable reference for the
community to further understand and improve the multilingual capability of
LLMs. OMGEval is available at https://github.com/blcuicall/OMGEval.
Related papers
- All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages [73.93600813999306]
ALM-bench is the largest and most comprehensive effort to date for evaluating LMMs across 100 languages.
It challenges existing models by testing their ability to understand and reason about culturally diverse images paired with text in various languages.
The benchmark offers a robust and nuanced evaluation framework featuring various question formats, including true/false, multiple choice, and open-ended questions.
arXiv Detail & Related papers (2024-11-25T15:44:42Z) - MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models [3.961168847961322]
Large language models (LLMs) are commonly used as evaluators in tasks, where they act as proxies for human preferences or judgments.
Existing benchmarks primarily focus on English, offering limited insight into LLMs' effectiveness as evaluators in non-English contexts.
We introduce MM-Eval, a multilingual meta-evaluation benchmark that covers 18 languages across six categories.
arXiv Detail & Related papers (2024-10-23T06:04:55Z) - Adapting Multilingual LLMs to Low-Resource Languages with Knowledge Graphs via Adapters [3.7273829129985305]
This paper explores integration of graph knowledge from linguistic into multilingual Large Models (LLMs)
We employ language-specific adapters to improve performance for low-resource languages (LRLs) in sentiment analysis (SA) and named entity recognition (NER)
We assess how structured graph knowledge affects the performance of multilingual LLMs for LRLs in SA and NER.
arXiv Detail & Related papers (2024-07-01T15:56:24Z) - LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback [61.23008372927665]
We introduce xLLMs-100, which scales the multilingual capabilities of LLaMA and BLOOM to 100 languages.
We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks.
arXiv Detail & Related papers (2024-06-03T20:25:12Z) - MindMerger: Efficient Boosting LLM Reasoning in non-English Languages [26.334092384176518]
Reasoning capabilities are crucial for Large Language Models (LLMs)
We propose MindMerger, which merges LLMs with the external language understanding capabilities from multilingual models.
MindMerger consistently outperforms all baselines, especially in low-resource languages.
arXiv Detail & Related papers (2024-05-27T17:41:54Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - How Vocabulary Sharing Facilitates Multilingualism in LLaMA? [19.136382859468693]
Large Language Models (LLMs) often show strong performance on English tasks, while exhibiting limitations on other languages.
This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective.
arXiv Detail & Related papers (2023-11-15T16:13:14Z) - Establishing Vocabulary Tests as a Benchmark for Evaluating Large
Language Models [2.7013338932521416]
We advocate for the revival of vocabulary tests as a valuable tool for assessing Large Language Models (LLMs) performance.
We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge.
arXiv Detail & Related papers (2023-10-23T08:45:12Z) - Okapi: Instruction-tuned Large Language Models in Multiple Languages
with Reinforcement Learning from Human Feedback [61.83548032416181]
We present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages.
Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research.
arXiv Detail & Related papers (2023-07-29T18:01:46Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.