Holistic Evaluation of Language Models
- URL: http://arxiv.org/abs/2211.09110v2
- Date: Sun, 1 Oct 2023 21:44:23 GMT
- Title: Holistic Evaluation of Language Models
- Authors: Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara
Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya
Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian
Cosgrove, Christopher D. Manning, Christopher R\'e, Diana Acosta-Navas, Drew
A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu
Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert
Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar
Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani
Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang,
Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, Yuta
Koreeda
- Abstract summary: Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood.
We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models.
- Score: 183.94891340168175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LMs) are becoming the foundation for almost all major
language technologies, but their capabilities, limitations, and risks are not
well understood. We present Holistic Evaluation of Language Models (HELM) to
improve the transparency of language models. First, we taxonomize the vast
space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata)
that are of interest for LMs. Then we select a broad subset based on coverage
and feasibility, noting what's missing or underrepresented (e.g. question
answering for neglected English dialects, metrics for trustworthiness). Second,
we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration,
robustness, fairness, bias, toxicity, and efficiency) for each of 16 core
scenarios when possible (87.5% of the time). This ensures metrics beyond
accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We
also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze
specific aspects (e.g. reasoning, disinformation). Third, we conduct a
large-scale evaluation of 30 prominent language models (spanning open,
limited-access, and closed models) on all 42 scenarios, 21 of which were not
previously used in mainstream LM evaluation. Prior to HELM, models on average
were evaluated on just 17.9% of the core HELM scenarios, with some prominent
models not sharing a single scenario in common. We improve this to 96.0%: now
all 30 models have been densely benchmarked on the same core scenarios and
metrics under standardized conditions. Our evaluation surfaces 25 top-level
findings. For full transparency, we release all raw model prompts and
completions publicly for further analysis, as well as a general modular
toolkit. We intend for HELM to be a living benchmark for the community,
continuously updated with new scenarios, metrics, and models.
Related papers
- None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks [0.9831489366502301]
We introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts.
Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets available in English and Spanish.
Results show that all models experience remarkable accuracy drops, with an average loss of 57% on MMLU and 50% on UNED-Access 2024.
arXiv Detail & Related papers (2025-02-18T14:32:44Z) - Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data.
Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios.
Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z) - Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments [0.7852714805965528]
We develop a set of 30 counterfactual scenarios and collect ratings across 8 evaluation metrics from 206 respondents.
We fine-tuned different Large Language Models to predict average or individual human judgment across these metrics.
arXiv Detail & Related papers (2024-10-28T15:33:37Z) - MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models [3.961168847961322]
Large language models (LLMs) are commonly used as evaluators in tasks, where they act as proxies for human preferences or judgments.
Existing benchmarks primarily focus on English, offering limited insight into LLMs' effectiveness as evaluators in non-English contexts.
We introduce MM-Eval, a multilingual meta-evaluation benchmark that covers 18 languages across six categories.
arXiv Detail & Related papers (2024-10-23T06:04:55Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions [6.19084217044276]
Mitigating explicit and implicit biases in Large Language Models (LLMs) has become a critical focus in the field of natural language processing.
We introduce the Sensitivity Testing on Offensive Progressions dataset, which includes 450 offensive progressions containing 2,700 unique sentences.
Our findings reveal that even the best-performing models detect bias inconsistently, with success rates ranging from 19.3% to 69.8%.
arXiv Detail & Related papers (2024-09-20T18:34:38Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.