Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
- URL: http://arxiv.org/abs/2409.00844v1
- Date: Sun, 1 Sep 2024 21:18:14 GMT
- Title: Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
- Authors: Blair Yang, Fuyang Cui, Keiran Paster, Jimmy Ba, Pashootan Vaezipoor, Silviu Pitis, Michael R. Zhang,
- Abstract summary: We propose report cards, which are human-interpretable, natural language summaries of model behavior for specific skills or topics.
We develop a framework to evaluate report cards based on three criteria: specificity (ability to distinguish between models), faithfulness (accurate representation of model capabilities), and interpretability (clarity and relevance to humans).
- Score: 33.39343288446156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid development and dynamic nature of large language models (LLMs) make it difficult for conventional quantitative benchmarks to accurately assess their capabilities. We propose report cards, which are human-interpretable, natural language summaries of model behavior for specific skills or topics. We develop a framework to evaluate report cards based on three criteria: specificity (ability to distinguish between models), faithfulness (accurate representation of model capabilities), and interpretability (clarity and relevance to humans). We also propose an iterative algorithm for generating report cards without human supervision and explore its efficacy by ablating various design choices. Through experimentation with popular LLMs, we demonstrate that report cards provide insights beyond traditional benchmarks and can help address the need for a more interpretable and holistic evaluation of LLMs.
Related papers
- Towards More Effective Table-to-Text Generation: Assessing In-Context Learning and Self-Evaluation with Open-Source Models [0.0]
This study explores the effectiveness of various in-context learning strategies in language models (LMs) across benchmark datasets.
We employ a large language model (LLM) self-evaluation approach using chain-of-thought reasoning and assess its correlation with human-aligned metrics like BERTScore.
Our findings highlight the significant impact of examples in improving table-to-text generation and suggest that, while LLM self-evaluation has potential, its current alignment with human judgment could be enhanced.
arXiv Detail & Related papers (2024-10-15T09:19:42Z) - An LLM Feature-based Framework for Dialogue Constructiveness Assessment [8.87747076871578]
Research on dialogue constructiveness assessment focuses on (i) analysing conversational factors that influence individuals to take specific actions, win debates, change their perspectives or broaden their open-mindedness and (ii) predicting constructiveness outcomes following dialogues for such use cases.
These objectives can be achieved by training either interpretable feature-based models or neural models such as pre-trained language models.
We propose an LLM feature-based framework for dialogue constructiveness assessment that combines the strengths of feature-based and neural approaches.
arXiv Detail & Related papers (2024-06-20T22:10:52Z) - Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning [53.241569810013836]
We propose a new framework based on large language models (LLMs) and decision Tree reasoning (OCTree)
Our key idea is to leverage LLMs' reasoning capabilities to find good feature generation rules without manually specifying the search space.
Our empirical results demonstrate that this simple framework consistently enhances the performance of various prediction models.
arXiv Detail & Related papers (2024-06-12T08:31:34Z) - Lessons from the Trenches on Reproducible Evaluation of Language Models [60.522749986793094]
We draw on three years of experience in evaluating large language models to provide guidance and lessons for researchers.
We present the Language Model Evaluation Harness (lm-eval), an open source library for independent, reproducible, and evaluation of language models.
arXiv Detail & Related papers (2024-05-23T16:50:49Z) - A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks [30.54635848057259]
This paper conducts a comprehensive evaluation of well-known and high-performing large language models (LLMs)
We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization.
Our study reports both automatic results, accompanied by a detailed analysis.
arXiv Detail & Related papers (2024-05-16T16:56:54Z) - PRobELM: Plausibility Ranking Evaluation for Language Models [12.057770969325453]
PRobELM is a benchmark designed to assess language models' ability to discern more plausible scenarios through their parametric knowledge.
Our benchmark is constructed from a dataset curated from Wikidata edit histories, tailored to align the temporal bounds of the training data for the evaluated models.
arXiv Detail & Related papers (2024-04-04T21:57:11Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in
Natural Language Understanding [1.827510863075184]
Curriculum is a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena.
We show that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality.
arXiv Detail & Related papers (2022-04-13T10:32:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.