AlignBench: Benchmarking Chinese Alignment of Large Language Models
- URL: http://arxiv.org/abs/2311.18743v3
- Date: Tue, 5 Dec 2023 16:04:15 GMT
- Title: AlignBench: Benchmarking Chinese Alignment of Large Language Models
- Authors: Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi
Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam, Xiaohan Zhang, Lichao Sun,
Hongning Wang, Jing Zhang, Minlie Huang, Yuxiao Dong, Jie Tang
- Abstract summary: We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment.
Our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge with Chain-of-Thought to generate explanations and final ratings as evaluations.
We report AlignBench evaluated by CritiqueLLM, a dedicated Chinese evaluator LLM that recovers 95% of GPT-4's evaluation ability.
- Score: 100.30878214336444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Alignment has become a critical step for instruction-tuned Large Language
Models (LLMs) to become helpful assistants. However, effective evaluation of
alignment for emerging Chinese LLMs is still significantly lacking, calling for
real-scenario grounded, open-ended, challenging and automatic evaluations
tailored for alignment. To fill in this gap, we introduce AlignBench, a
comprehensive multi-dimensional benchmark for evaluating LLMs' alignment in
Chinese. Equipped with a human-in-the-loop data curation pipeline, our
benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge with
Chain-of-Thought to generate explanations and final ratings as evaluations,
ensuring high reliability and interpretability. Furthermore, we report
AlignBench evaluated by CritiqueLLM, a dedicated Chinese evaluator LLM that
recovers 95% of GPT-4's evaluation ability. We will provide public APIs for
evaluating AlignBench with CritiqueLLM to facilitate the evaluation of LLMs'
Chinese alignment. All evaluation codes, data, and LLM generations are
available at \url{https://github.com/THUDM/AlignBench}.
Related papers
- Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety [37.07970624135514]
OpenEval is an evaluation testbed that benchmarks Chinese LLMs across capability, alignment and safety.
For capability assessment, we include 12 benchmark datasets to evaluate Chinese LLMs from 4 sub-dimensions: NLP tasks, disciplinary knowledge, commonsense reasoning and mathematical reasoning.
For alignment assessment, OpenEval contains 7 datasets that examines the bias, offensiveness and illegalness in the outputs yielded by Chinese LLMs.
arXiv Detail & Related papers (2024-03-18T23:21:37Z) - PiCO: Peer Review in LLMs based on the Consistency Optimization [19.130941716491716]
We use peer-review mechanisms to measure large language models (LLMs) automatically.
We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores.
We propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings.
arXiv Detail & Related papers (2024-02-02T18:49:26Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Flames: Benchmarking Value Alignment of LLMs in Chinese [86.73527292670308]
This paper proposes a value alignment benchmark named Flames.
It encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values.
Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames.
arXiv Detail & Related papers (2023-11-12T17:18:21Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large
Language Models [17.562961249150295]
We propose the ZhuJiu benchmark for large language models (LLMs) evaluation.
ZhuJiu is the pioneering benchmark that fully assesses LLMs in Chinese, while also providing equally robust evaluation abilities in English.
The ZhuJiu benchmark and open-participation leaderboard are publicly released at http://www.zhujiu-benchmark.com/.
arXiv Detail & Related papers (2023-08-28T06:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.