Advancing the Evaluation of Traditional Chinese Language Models: Towards
a Comprehensive Benchmark Suite
- URL: http://arxiv.org/abs/2309.08448v2
- Date: Mon, 2 Oct 2023 15:22:42 GMT
- Title: Advancing the Evaluation of Traditional Chinese Language Models: Towards
a Comprehensive Benchmark Suite
- Authors: Chan-Jan Hsu, Chang-Le Liu, Feng-Ting Liao, Po-Chun Hsu, Yi-Chang
Chen, Da-shan Shiu
- Abstract summary: We propose a novel set of benchmarks that leverage existing English datasets and are tailored to evaluate language models in Traditional Chinese.
These benchmarks encompass a wide range of tasks, including contextual question-answering, summarization, classification, and table understanding.
In this paper, we evaluate the performance of GPT-3.5, Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks.
- Score: 17.764840326809797
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The evaluation of large language models is an essential task in the field of
language understanding and generation. As language models continue to advance,
the need for effective benchmarks to assess their performance has become
imperative. In the context of Traditional Chinese, there is a scarcity of
comprehensive and diverse benchmarks to evaluate the capabilities of language
models, despite the existence of certain benchmarks such as DRCD, TTQA, CMDQA,
and FGC dataset. To address this gap, we propose a novel set of benchmarks that
leverage existing English datasets and are tailored to evaluate language models
in Traditional Chinese. These benchmarks encompass a wide range of tasks,
including contextual question-answering, summarization, classification, and
table understanding. The proposed benchmarks offer a comprehensive evaluation
framework, enabling the assessment of language models' capabilities across
different tasks. In this paper, we evaluate the performance of GPT-3.5,
Taiwan-LLaMa-v1.0, and Model 7-C, our proprietary model, on these benchmarks.
The evaluation results highlight that our model, Model 7-C, achieves
performance comparable to GPT-3.5 with respect to a part of the evaluated
capabilities. In an effort to advance the evaluation of language models in
Traditional Chinese and stimulate further research in this field, we have
open-sourced our benchmark and opened the model for trial.
Related papers
- Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks [3.773596042872403]
Large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount.
Various frameworks have emerged as noteworthy contributions to the field, offering comprehensive evaluation tests and benchmarks.
This paper provides an exploration and critical analysis of some of these evaluation methodologies, shedding light on their strengths, limitations, and impact on advancing the state-of-the-art in natural language processing.
arXiv Detail & Related papers (2024-07-29T03:37:14Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Construction of a Japanese Financial Benchmark for Large Language Models [0.7329727526222747]
GPT-4 is currently outstanding, and that the constructed benchmarks function effectively.
Our benchmark can differentiate benchmark scores among models in all performance ranges by combining tasks with different difficulties.
arXiv Detail & Related papers (2024-03-22T09:40:27Z) - Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy [27.454549324141087]
We propose a novel VQA benchmark based on well-known visual classification datasets.
We also suggest using the semantic hierarchy of the label space to ask automatically generated follow-up questions about the ground-truth category.
Our contributions aim to lay the foundation for more precise and meaningful assessments.
arXiv Detail & Related papers (2024-02-11T18:26:18Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples.
With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot.
We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z) - Towards Better Instruction Following Language Models for Chinese:
Investigating the Impact of Training Data and Evaluation [12.86275938443485]
We examine the influence of training data factors, including quantity, quality, and linguistic distribution, on model performance.
We assess various models using a evaluation set of 1,000 samples, encompassing nine real-world scenarios.
We extend the vocabulary of LLaMA - the model with the closest open-source performance to proprietary language models like GPT-3.
arXiv Detail & Related papers (2023-04-16T18:37:39Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.