CLEVA: Chinese Language Models EVAluation Platform
- URL: http://arxiv.org/abs/2308.04813v2
- Date: Mon, 16 Oct 2023 11:32:14 GMT
- Title: CLEVA: Chinese Language Models EVAluation Platform
- Authors: Yanyang Li, Jianqiao Zhao, Duo Zheng, Zi-Yuan Hu, Zhi Chen, Xiaohui
Su, Yongfeng Huang, Shijia Huang, Dahua Lin, Michael R. Lyu, Liwei Wang
- Abstract summary: We present CLEVA, a user-friendly platform crafted to holistically evaluate Chinese LLMs.
Our platform employs a standardized workflow to assess LLMs' performance across various dimensions, regularly updating a competitive leaderboard.
To alleviate contamination, CLEVA curates a significant proportion of new data and develops a sampling strategy that guarantees a unique subset for each leaderboard round.
Empowered by an easy-to-use interface that requires just a few mouse clicks and a model API, users can conduct a thorough evaluation with minimal coding.
- Score: 92.42981537317817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the continuous emergence of Chinese Large Language Models (LLMs), how to
evaluate a model's capabilities has become an increasingly significant issue.
The absence of a comprehensive Chinese benchmark that thoroughly assesses a
model's performance, the unstandardized and incomparable prompting procedure,
and the prevalent risk of contamination pose major challenges in the current
evaluation of Chinese LLMs. We present CLEVA, a user-friendly platform crafted
to holistically evaluate Chinese LLMs. Our platform employs a standardized
workflow to assess LLMs' performance across various dimensions, regularly
updating a competitive leaderboard. To alleviate contamination, CLEVA curates a
significant proportion of new data and develops a sampling strategy that
guarantees a unique subset for each leaderboard round. Empowered by an
easy-to-use interface that requires just a few mouse clicks and a model API,
users can conduct a thorough evaluation with minimal coding. Large-scale
experiments featuring 23 Chinese LLMs have validated CLEVA's efficacy.
Related papers
- LLaVA-Critic: Learning to Evaluate Multimodal Models [110.06665155812162]
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator.
LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios.
arXiv Detail & Related papers (2024-10-03T17:36:33Z) - LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models [71.8065384742686]
LMMS-EVAL is a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models.
LMMS-EVAL LITE is a pruned evaluation toolkit that emphasizes both coverage and efficiency.
Multimodal LIVEBENCH utilizes continuously updating news and online forums to assess models' generalization abilities in the wild.
arXiv Detail & Related papers (2024-07-17T17:51:53Z) - Dynamic data sampler for cross-language transfer learning in large language models [34.464472766868106]
ChatFlow is a cross-language transfer-based Large Language Models (LLMs)
We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model.
Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance.
arXiv Detail & Related papers (2024-05-17T08:40:51Z) - Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages.
This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications.
Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences.
We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - LM-Polygraph: Uncertainty Estimation for Language Models [71.21409522341482]
Uncertainty estimation (UE) methods are one path to safer, more responsible, and more effective use of large language models (LLMs)
We introduce LM-Polygraph, a framework with implementations of a battery of state-of-the-art UE methods for LLMs in text generation tasks, with unified program interfaces in Python.
It introduces an extendable benchmark for consistent evaluation of UE techniques by researchers, and a demo web application that enriches the standard chat dialog with confidence scores.
arXiv Detail & Related papers (2023-11-13T15:08:59Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.