TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for
Human-Aligned LLMs
- URL: http://arxiv.org/abs/2311.05374v1
- Date: Thu, 9 Nov 2023 13:58:59 GMT
- Title: TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for
Human-Aligned LLMs
- Authors: Shuyi Xie, Wenlin Yao, Yong Dai, Shaobo Wang, Donlin Zhou, Lifeng Jin,
Xinhua Feng, Pengzhi Wei, Yujie Lin, Zhichao Hu, Dong Yu, Zhengyou Zhang,
Jing Nie, Yuhong Liu
- Abstract summary: Large language models (LLMs) have shown impressive capabilities across various natural language tasks.
We propose a comprehensive human evaluation framework to assess LLMs' proficiency in following instructions on diverse real-world tasks.
- Score: 35.717370285231176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have shown impressive capabilities across
various natural language tasks. However, evaluating their alignment with human
preferences remains a challenge. To this end, we propose a comprehensive human
evaluation framework to assess LLMs' proficiency in following instructions on
diverse real-world tasks. We construct a hierarchical task tree encompassing 7
major areas covering over 200 categories and over 800 tasks, which covers
diverse capabilities such as question answering, reasoning, multiturn dialogue,
and text generation, to evaluate LLMs in a comprehensive and in-depth manner.
We also design detailed evaluation standards and processes to facilitate
consistent, unbiased judgments from human evaluators. A test set of over 3,000
instances is released, spanning different difficulty levels and knowledge
domains. Our work provides a standardized methodology to evaluate human
alignment in LLMs for both English and Chinese. We also analyze the feasibility
of automating parts of evaluation with a strong LLM (GPT-4). Our framework
supports a thorough assessment of LLMs as they are integrated into real-world
applications. We have made publicly available the task tree, TencentLLMEval
dataset, and evaluation methodology which have been demonstrated as effective
in assessing the performance of Tencent Hunyuan LLMs. By doing so, we aim to
facilitate the benchmarking of advances in the development of safe and
human-aligned LLMs.
Related papers
- Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications.
Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences.
We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - Collaborative Evaluation: Exploring the Synergy of Large Language Models
and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations.
We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation [16.73300162869746]
Large Language Models (LLMs) have made progress in various real-world tasks.
Existing evaluation methods are mainly supervised signal-based.
We propose a novel Deep Interaction-based LLM-evaluation framework.
arXiv Detail & Related papers (2023-09-08T15:00:41Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.