Dimensions of Generative AI Evaluation Design
- URL: http://arxiv.org/abs/2411.12709v1
- Date: Tue, 19 Nov 2024 18:25:30 GMT
- Title: Dimensions of Generative AI Evaluation Design
- Authors: P. Alex Dow, Jennifer Wortman Vaughan, Solon Barocas, Chad Atalla, Alexandra Chouldechova, Hanna Wallach,
- Abstract summary: We propose a set of general dimensions that capture critical choices involved in GenAI evaluation design.
These dimensions include the evaluation setting, the task type, the input source, the interaction style, the duration, the metric type, and the scoring method.
- Score: 51.541816010127256
- License:
- Abstract: There are few principles or guidelines to ensure evaluations of generative AI (GenAI) models and systems are effective. To help address this gap, we propose a set of general dimensions that capture critical choices involved in GenAI evaluation design. These dimensions include the evaluation setting, the task type, the input source, the interaction style, the duration, the metric type, and the scoring method. By situating GenAI evaluations within these dimensions, we aim to guide decision-making during GenAI evaluation design and provide a structure for comparing different evaluations. We illustrate the utility of the proposed set of general dimensions using two examples: a hypothetical evaluation of the fairness of a GenAI system and three real-world GenAI evaluations of biological threats.
Related papers
- Position: Evaluating Generative AI Systems is a Social Science Measurement Challenge [78.35388859345056]
We argue that the ML community would benefit from learning from and drawing on the social sciences when developing measurement instruments for evaluating GenAI systems.
We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI.
arXiv Detail & Related papers (2025-02-01T21:09:51Z) - A Shared Standard for Valid Measurement of Generative AI Systems' Capabilities, Risks, and Impacts [38.66213773948168]
The valid measurement of generative AI (GenAI) systems' capabilities, risks, and impacts forms the bedrock of our ability to evaluate these systems.
We introduce a shared standard for valid measurement that helps place many of the disparate-seeming evaluation practices in use today on a common footing.
arXiv Detail & Related papers (2024-12-02T19:50:00Z) - Evaluating Generative AI Systems is a Social Science Measurement Challenge [78.35388859345056]
We present a framework for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems.
The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves.
arXiv Detail & Related papers (2024-11-17T02:35:30Z) - GenLens: A Systematic Evaluation of Visual GenAI Model Outputs [33.93591473459988]
GenLens is a visual analytic interface designed for the systematic evaluation of GenAI model outputs.
A user study with model developers reveals that GenLens effectively enhances their workflow, evidenced by high satisfaction rates.
arXiv Detail & Related papers (2024-02-06T04:41:06Z) - How much informative is your XAI? A decision-making assessment task to
objectively measure the goodness of explanations [53.01494092422942]
The number and complexity of personalised and user-centred approaches to XAI have rapidly grown in recent years.
It emerged that user-centred approaches to XAI positively affect the interaction between users and systems.
We propose an assessment task to objectively and quantitatively measure the goodness of XAI systems.
arXiv Detail & Related papers (2023-12-07T15:49:39Z) - Towards a Comprehensive Human-Centred Evaluation Framework for
Explainable AI [1.7222662622390634]
We propose to adapt the User-Centric Evaluation Framework used in recommender systems.
We integrate explanation aspects, summarise explanation properties, indicate relations between them, and categorise metrics that measure these properties.
arXiv Detail & Related papers (2023-07-31T09:20:16Z) - A System's Approach Taxonomy for User-Centred XAI: A Survey [0.6882042556551609]
We propose a unified, inclusive and user-centred taxonomy for XAI based on the principles of General System's Theory.
This provides a basis for evaluating the appropriateness of XAI approaches for all user types, including both developers and end users.
arXiv Detail & Related papers (2023-03-06T00:50:23Z) - Connecting Algorithmic Research and Usage Contexts: A Perspective of
Contextualized Evaluation for Explainable AI [65.44737844681256]
A lack of consensus on how to evaluate explainable AI (XAI) hinders the advancement of the field.
We argue that one way to close the gap is to develop evaluation methods that account for different user requirements.
arXiv Detail & Related papers (2022-06-22T05:17:33Z) - Evaluation of Text Generation: A Survey [107.62760642328455]
The paper surveys evaluation methods of natural language generation systems that have been developed in the last few years.
We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics.
arXiv Detail & Related papers (2020-06-26T04:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.