Related papers: Dimensions of Generative AI Evaluation Design

Dimensions of Generative AI Evaluation Design

URL: http://arxiv.org/abs/2411.12709v1
Date: Tue, 19 Nov 2024 18:25:30 GMT
Title: Dimensions of Generative AI Evaluation Design
Authors: P. Alex Dow, Jennifer Wortman Vaughan, Solon Barocas, Chad Atalla, Alexandra Chouldechova, Hanna Wallach,
Abstract summary: We propose a set of general dimensions that capture critical choices involved in GenAI evaluation design. These dimensions include the evaluation setting, the task type, the input source, the interaction style, the duration, the metric type, and the scoring method.
Score: 51.541816010127256
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There are few principles or guidelines to ensure evaluations of generative AI (GenAI) models and systems are effective. To help address this gap, we propose a set of general dimensions that capture critical choices involved in GenAI evaluation design. These dimensions include the evaluation setting, the task type, the input source, the interaction style, the duration, the metric type, and the scoring method. By situating GenAI evaluations within these dimensions, we aim to guide decision-making during GenAI evaluation design and provide a structure for comparing different evaluations. We illustrate the utility of the proposed set of general dimensions using two examples: a hypothetical evaluation of the fairness of a GenAI system and three real-world GenAI evaluations of biological threats.

Related papers

Position: Evaluating Generative AI Systems is a Social Science Measurement Challenge [78.35388859345056]
We argue that the ML community would benefit from learning from and drawing on the social sciences when developing measurement instruments for evaluating GenAI systems. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI.
arXiv Detail & Related papers (2025-02-01T21:09:51Z)
A Shared Standard for Valid Measurement of Generative AI Systems' Capabilities, Risks, and Impacts [38.66213773948168]
The valid measurement of generative AI (GenAI) systems' capabilities, risks, and impacts forms the bedrock of our ability to evaluate these systems. We introduce a shared standard for valid measurement that helps place many of the disparate-seeming evaluation practices in use today on a common footing.
arXiv Detail & Related papers (2024-12-02T19:50:00Z)
Evaluating Generative AI Systems is a Social Science Measurement Challenge [78.35388859345056]
We present a framework for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves.
arXiv Detail & Related papers (2024-11-17T02:35:30Z)
GAIA: Rethinking Action Quality Assessment for AI-Generated Videos [56.047773400426486]
Action quality assessment (AQA) algorithms predominantly focus on actions from real specific scenarios and are pre-trained with normative action features. We construct GAIA, a Generic AI-generated Action dataset, by conducting a large-scale subjective evaluation from a novel causal reasoning-based perspective. Results show that traditional AQA methods, action-related metrics in recent T2V benchmarks, and mainstream video quality methods perform poorly with an average SRCC of 0.454, 0.191, and 0.519, respectively.
arXiv Detail & Related papers (2024-06-10T08:18:07Z)
GenLens: A Systematic Evaluation of Visual GenAI Model Outputs [33.93591473459988]
GenLens is a visual analytic interface designed for the systematic evaluation of GenAI model outputs. A user study with model developers reveals that GenLens effectively enhances their workflow, evidenced by high satisfaction rates.
arXiv Detail & Related papers (2024-02-06T04:41:06Z)
How much informative is your XAI? A decision-making assessment task to objectively measure the goodness of explanations [53.01494092422942]
The number and complexity of personalised and user-centred approaches to XAI have rapidly grown in recent years. It emerged that user-centred approaches to XAI positively affect the interaction between users and systems. We propose an assessment task to objectively and quantitatively measure the goodness of XAI systems.
arXiv Detail & Related papers (2023-12-07T15:49:39Z)
Towards a Comprehensive Human-Centred Evaluation Framework for Explainable AI [1.7222662622390634]
We propose to adapt the User-Centric Evaluation Framework used in recommender systems. We integrate explanation aspects, summarise explanation properties, indicate relations between them, and categorise metrics that measure these properties.
arXiv Detail & Related papers (2023-07-31T09:20:16Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
A System's Approach Taxonomy for User-Centred XAI: A Survey [0.6882042556551609]
We propose a unified, inclusive and user-centred taxonomy for XAI based on the principles of General System's Theory. This provides a basis for evaluating the appropriateness of XAI approaches for all user types, including both developers and end users.
arXiv Detail & Related papers (2023-03-06T00:50:23Z)
Connecting Algorithmic Research and Usage Contexts: A Perspective of Contextualized Evaluation for Explainable AI [65.44737844681256]
A lack of consensus on how to evaluate explainable AI (XAI) hinders the advancement of the field. We argue that one way to close the gap is to develop evaluation methods that account for different user requirements.
arXiv Detail & Related papers (2022-06-22T05:17:33Z)
Crowdsourcing Evaluation of Saliency-based XAI Methods [18.18238526746074]
We propose a new human-based evaluation scheme using crowdsourcing to evaluate XAI methods. Our method is inspired by a human computation game, "Peek-a-boom" We evaluate the saliency maps of various XAI methods on two datasets with automated and crowd-based evaluation schemes.
arXiv Detail & Related papers (2021-06-27T17:37:53Z)
Should We Trust (X)AI? Design Dimensions for Structured Experimental Evaluations [19.68184991543289]
This paper systematically derives design dimensions for the structured evaluation of explainable artificial intelligence (XAI) approaches. They enable a descriptive characterization, facilitating comparisons between different study designs. They further structure the design space of XAI, converging towards a precise terminology required for a rigorous study of XAI.
arXiv Detail & Related papers (2020-09-14T13:40:51Z)
Evaluation of Text Generation: A Survey [107.62760642328455]
The paper surveys evaluation methods of natural language generation systems that have been developed in the last few years. We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics.
arXiv Detail & Related papers (2020-06-26T04:52:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.