Evaluating the Generation Capabilities of Large Chinese Language Models
- URL: http://arxiv.org/abs/2308.04823v4
- Date: Tue, 30 Jan 2024 00:00:57 GMT
- Title: Evaluating the Generation Capabilities of Large Chinese Language Models
- Authors: Hui Zeng, Jingyuan Xue, Meng Hao, Chen Sun, Bin Ning, Na Zhang
- Abstract summary: This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework.
It assesses the generative capabilities of large Chinese language models across a spectrum of academic disciplines.
Gscore automates the quality measurement of a model's text generation against reference standards.
- Score: 27.598864484231477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper unveils CG-Eval, the first-ever comprehensive and automated
evaluation framework designed for assessing the generative capabilities of
large Chinese language models across a spectrum of academic disciplines.
CG-Eval stands out for its automated process, which critically assesses models
based on their proficiency in generating precise and contextually relevant
responses to a diverse array of questions within six key domains: Science and
Engineering, Humanities and Social Sciences, Mathematical Calculations, Medical
Practitioner Qualification Examination, Judicial Examination, and Certified
Public Accountant Examination. Alongside this, we introduce Gscore, an
innovative composite index developed from a weighted sum of multiple metrics.
Gscore uniquely automates the quality measurement of a model's text generation
against reference standards, providing a detailed and nuanced assessment of
model performance. This automation not only enhances the efficiency and
scalability of the evaluation process but also ensures objective and consistent
assessment across various models. The detailed test data and results,
highlighting the robust capabilities and comparative performance of the
evaluated models, are accessible at http://cgeval.besteasy.com/.
Related papers
- Quantitative Assurance and Synthesis of Controllers from Activity
Diagrams [4.419843514606336]
Probabilistic model checking is a widely used formal verification technique to automatically verify qualitative and quantitative properties.
This makes it not accessible for researchers and engineers who may not have the required knowledge.
We propose a comprehensive verification framework for ADs, including a new profile for probability time, quality annotations, a semantics interpretation of ADs in three Markov models, and a set of transformation rules from activity diagrams to the PRISM language.
Most importantly, we developed algorithms for transformation and implemented them in a tool, called QASCAD, using model-based techniques, for fully automated verification.
arXiv Detail & Related papers (2024-02-29T22:40:39Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - SelfEval: Leveraging the discriminative nature of generative models for
evaluation [35.7242199928684]
We show that text-to-image generative models can be 'inverted' to assess their own text-image understanding capabilities.
Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts.
arXiv Detail & Related papers (2023-11-17T18:58:16Z) - LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks.
Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries.
These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.