Related papers: SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

URL: http://arxiv.org/abs/2307.15020v1
Date: Thu, 27 Jul 2023 17:24:09 GMT
Title: SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark
Authors: Liang Xu, Anqi Li, Lei Zhu, Hang Xue, Changtai Zhu, Kangkang Zhao, Haonan He, Xuanwei Zhang, Qiyue Kang, Zhenzhong Lan
Abstract summary: We propose a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE) Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences achieved on open-ended ones.
Score: 16.802854803128433
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have shown the potential to be integrated into human daily lives. Therefore, user preference is the most critical criterion for assessing LLMs' performance in real-world scenarios. However, existing benchmarks mainly focus on measuring models' accuracy using multi-choice questions, which limits the understanding of their capabilities in real applications. We fill this gap by proposing a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE). Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences achieved on open-ended ones. At the same time, they can complement each other to predict actual user preferences. We also demonstrate that GPT-4 is a reliable judge to automatically evaluate human preferences on open-ended questions in a Chinese context. Our benchmark will be released at https://www.CLUEbenchmarks.com

Related papers

Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z)
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language [2.594684920405059]
We present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria. We experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods.
arXiv Detail & Related papers (2025-03-31T05:04:25Z)
Humanity's Last Exam [434.8511341499966]
Humanity's Last Exam (HLE) is a multi-modal benchmark at the frontier of human knowledge. It consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval.
arXiv Detail & Related papers (2025-01-24T05:27:46Z)
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems [43.19298196163617]
We develop MIRAGE-Bench, a standardized arena-based multilingual RAG benchmark for 18 diverse languages on Wikipedia. Using this idea, We develop MIRAGE-Bench, a standardized arena-based multilingual RAG benchmark for 18 diverse languages on Wikipedia.
arXiv Detail & Related papers (2024-10-17T16:18:49Z)
PersoBench: Benchmarking Personalized Response Generation in Large Language Models [6.8046587254152735]
We present a new benchmark, PersoBench, to evaluate the personalization ability of large language models (LLMs) in persona-aware dialogue generation. Our analysis, conducted on three well-known persona-aware datasets, evaluates multiple dimensions of response quality, including fluency, diversity, coherence, and personalization.
arXiv Detail & Related papers (2024-10-04T07:29:41Z)
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors. We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models. In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z)
Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset [7.954348293179786]
We propose CFLUE, a benchmark to assess the capability of large language models (LLMs) across various dimensions. In knowledge assessment, it consists of 38K+ multiple-choice questions with associated solution explanations. In application assessment, it features 16K+ test instances across distinct groups of NLP tasks such as text classification, machine translation, relation extraction, reading comprehension, and text generation.
arXiv Detail & Related papers (2024-05-17T05:03:40Z)
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z)
AlignBench: Benchmarking Chinese Alignment of Large Language Models [99.24597941555277]
We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment. We design a human-in-the-loop data curation pipeline, containing eight main categories, 683 real-scenario rooted queries and corresponding human verified references. For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judgecitezheng2023judging approach with Chain-of-Thought to generate explanations and final ratings.
arXiv Detail & Related papers (2023-11-30T17:41:30Z)
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z)
ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models [17.562961249150295]
We propose the ZhuJiu benchmark for large language models (LLMs) evaluation. ZhuJiu is the pioneering benchmark that fully assesses LLMs in Chinese, while also providing equally robust evaluation abilities in English. The ZhuJiu benchmark and open-participation leaderboard are publicly released at http://www.zhujiu-benchmark.com/.
arXiv Detail & Related papers (2023-08-28T06:56:44Z)
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z)
Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples. With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot. We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.