AlignBench: Benchmarking Chinese Alignment of Large Language Models
- URL: http://arxiv.org/abs/2311.18743v3
- Date: Tue, 5 Dec 2023 16:04:15 GMT
- Title: AlignBench: Benchmarking Chinese Alignment of Large Language Models
- Authors: Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi
Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam, Xiaohan Zhang, Lichao Sun,
Hongning Wang, Jing Zhang, Minlie Huang, Yuxiao Dong, Jie Tang
- Abstract summary: We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment.
Our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge with Chain-of-Thought to generate explanations and final ratings as evaluations.
We report AlignBench evaluated by CritiqueLLM, a dedicated Chinese evaluator LLM that recovers 95% of GPT-4's evaluation ability.
- Score: 100.30878214336444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Alignment has become a critical step for instruction-tuned Large Language
Models (LLMs) to become helpful assistants. However, effective evaluation of
alignment for emerging Chinese LLMs is still significantly lacking, calling for
real-scenario grounded, open-ended, challenging and automatic evaluations
tailored for alignment. To fill in this gap, we introduce AlignBench, a
comprehensive multi-dimensional benchmark for evaluating LLMs' alignment in
Chinese. Equipped with a human-in-the-loop data curation pipeline, our
benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge with
Chain-of-Thought to generate explanations and final ratings as evaluations,
ensuring high reliability and interpretability. Furthermore, we report
AlignBench evaluated by CritiqueLLM, a dedicated Chinese evaluator LLM that
recovers 95% of GPT-4's evaluation ability. We will provide public APIs for
evaluating AlignBench with CritiqueLLM to facilitate the evaluation of LLMs'
Chinese alignment. All evaluation codes, data, and LLM generations are
available at \url{https://github.com/THUDM/AlignBench}.
Related papers
- Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach.
This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets.
We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z) - TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation [24.954629877691623]
TICK (Targeted Instruct-evaluation with ChecKlists) is a fully automated, interpretable evaluation protocol.
We first show that, given an instruction, LLMs can reliably produce high-quality, tailored evaluation checklists.
We then show that STICK can be used to improve generation quality across multiple benchmarks via self-refinement and Best-of-N selection.
arXiv Detail & Related papers (2024-10-04T17:09:08Z) - What do Large Language Models Need for Machine Translation Evaluation? [12.42394213466485]
Large language models (LLMs) can achieve results comparable to fine-tuned multilingual pre-trained language models.
This paper explores what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate machine translation quality.
arXiv Detail & Related papers (2024-10-04T09:50:45Z) - DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation [48.11754113512047]
This study includes a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains.
Our pipeline works in a fully automated manner, enabling a push-bottom construction from code repositories into formatted subjects under study.
The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL.
arXiv Detail & Related papers (2024-08-23T16:33:58Z) - OpenEval: Benchmarking Chinese LLMs across Capability, Alignment and Safety [37.07970624135514]
OpenEval is an evaluation testbed that benchmarks Chinese LLMs across capability, alignment and safety.
For capability assessment, we include 12 benchmark datasets to evaluate Chinese LLMs from 4 sub-dimensions: NLP tasks, disciplinary knowledge, commonsense reasoning and mathematical reasoning.
For alignment assessment, OpenEval contains 7 datasets that examines the bias, offensiveness and illegalness in the outputs yielded by Chinese LLMs.
arXiv Detail & Related papers (2024-03-18T23:21:37Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization.
Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z) - Flames: Benchmarking Value Alignment of LLMs in Chinese [86.73527292670308]
This paper proposes a value alignment benchmark named Flames.
It encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values.
Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames.
arXiv Detail & Related papers (2023-11-12T17:18:21Z) - FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models.
We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level.
By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.