Evaluating the Generation Capabilities of Large Chinese Language Models
- URL: http://arxiv.org/abs/2308.04823v4
- Date: Tue, 30 Jan 2024 00:00:57 GMT
- Title: Evaluating the Generation Capabilities of Large Chinese Language Models
- Authors: Hui Zeng, Jingyuan Xue, Meng Hao, Chen Sun, Bin Ning, Na Zhang
- Abstract summary: This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework.
It assesses the generative capabilities of large Chinese language models across a spectrum of academic disciplines.
Gscore automates the quality measurement of a model's text generation against reference standards.
- Score: 27.598864484231477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper unveils CG-Eval, the first-ever comprehensive and automated
evaluation framework designed for assessing the generative capabilities of
large Chinese language models across a spectrum of academic disciplines.
CG-Eval stands out for its automated process, which critically assesses models
based on their proficiency in generating precise and contextually relevant
responses to a diverse array of questions within six key domains: Science and
Engineering, Humanities and Social Sciences, Mathematical Calculations, Medical
Practitioner Qualification Examination, Judicial Examination, and Certified
Public Accountant Examination. Alongside this, we introduce Gscore, an
innovative composite index developed from a weighted sum of multiple metrics.
Gscore uniquely automates the quality measurement of a model's text generation
against reference standards, providing a detailed and nuanced assessment of
model performance. This automation not only enhances the efficiency and
scalability of the evaluation process but also ensures objective and consistent
assessment across various models. The detailed test data and results,
highlighting the robust capabilities and comparative performance of the
evaluated models, are accessible at http://cgeval.besteasy.com/.
Related papers
- Automated Genre-Aware Article Scoring and Feedback Using Large Language Models [8.10826723408637]
This paper focuses on the development of an advanced intelligent article scoring system.
It assesses the overall quality of written work and offers detailed feature-based scoring tailored to various article genres.
arXiv Detail & Related papers (2024-10-18T04:13:51Z) - An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation [29.81362106367831]
Existing evaluation methods often suffer from high costs, limited test formats, the need of human references, and systematic evaluation biases.
In contrast to previous studies that rely on human annotations, Auto-PRE selects evaluators automatically based on their inherent traits.
Experimental results indicate our Auto-PRE achieves state-of-the-art performance at a lower cost.
arXiv Detail & Related papers (2024-10-16T06:06:06Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores [23.568883428947494]
We investigate whether prominent LM-based evaluation metrics demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks.
Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries.
These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality.
arXiv Detail & Related papers (2023-11-16T10:43:26Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.