Evaluating Large Language Models: A Comprehensive Survey
- URL: http://arxiv.org/abs/2310.19736v3
- Date: Sat, 25 Nov 2023 17:35:12 GMT
- Title: Evaluating Large Language Models: A Comprehensive Survey
- Authors: Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Supryadi,
Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong
- Abstract summary: Large language models (LLMs) have demonstrated remarkable capabilities across a broad spectrum of tasks.
They could suffer from private data leaks or yield inappropriate, harmful, or misleading content.
To effectively capitalize on LLM capacities as well as ensure their safe and beneficial development, it is critical to conduct a rigorous and comprehensive evaluation.
- Score: 41.64914110226901
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across
a broad spectrum of tasks. They have attracted significant attention and been
deployed in numerous downstream applications. Nevertheless, akin to a
double-edged sword, LLMs also present potential risks. They could suffer from
private data leaks or yield inappropriate, harmful, or misleading content.
Additionally, the rapid progress of LLMs raises concerns about the potential
emergence of superintelligent systems without adequate safeguards. To
effectively capitalize on LLM capacities as well as ensure their safe and
beneficial development, it is critical to conduct a rigorous and comprehensive
evaluation of LLMs.
This survey endeavors to offer a panoramic perspective on the evaluation of
LLMs. We categorize the evaluation of LLMs into three major groups: knowledge
and capability evaluation, alignment evaluation and safety evaluation. In
addition to the comprehensive review on the evaluation methodologies and
benchmarks on these three aspects, we collate a compendium of evaluations
pertaining to LLMs' performance in specialized domains, and discuss the
construction of comprehensive evaluation platforms that cover LLM evaluations
on capabilities, alignment, safety, and applicability.
We hope that this comprehensive overview will stimulate further research
interests in the evaluation of LLMs, with the ultimate goal of making
evaluation serve as a cornerstone in guiding the responsible development of
LLMs. We envision that this will channel their evolution into a direction that
maximizes societal benefit while minimizing potential risks. A curated list of
related papers has been publicly available at
https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.
Related papers
- Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks.
We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z) - A Survey of Useful LLM Evaluation [20.048914787813263]
Two-stage framework: from core ability'' to agent''
In the "core ability" stage, we discussed the reasoning ability, societal impact, and domain knowledge of LLMs.
In the agent'' stage, we demonstrated embodied action, planning, and tool learning of LLMs agent applications.
arXiv Detail & Related papers (2024-06-03T02:20:03Z) - Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - Multitask-based Evaluation of Open-Source LLM on Software Vulnerability [2.7692028382314815]
This paper proposes a pipeline for quantitatively evaluating interactive Large Language Models (LLMs) using publicly available datasets.
We carry out an extensive technical evaluation of LLMs using Big-Vul covering four different common software vulnerability tasks.
We find that the existing state-of-the-art approaches and pre-trained Language Models (LMs) are generally superior to LLMs in software vulnerability detection.
arXiv Detail & Related papers (2024-04-02T15:52:05Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods.
We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification.
Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z) - Through the Lens of Core Competency: Survey on Evaluation of Large
Language Models [27.271533306818732]
Large language model (LLM) has excellent performance and wide practical uses.
Existing evaluation tasks are difficult to keep up with the wide range of applications in real-world scenarios.
We summarize 4 core competencies of LLM, including reasoning, knowledge, reliability, and safety.
Under this competency architecture, similar tasks are combined to reflect corresponding ability, while new tasks can also be easily added into the system.
arXiv Detail & Related papers (2023-08-15T17:40:34Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes.
To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.