Evaluating Generative AI Systems is a Social Science Measurement Challenge
- URL: http://arxiv.org/abs/2411.10939v1
- Date: Sun, 17 Nov 2024 02:35:30 GMT
- Title: Evaluating Generative AI Systems is a Social Science Measurement Challenge
- Authors: Hanna Wallach, Meera Desai, Nicholas Pangakis, A. Feder Cooper, Angelina Wang, Solon Barocas, Alexandra Chouldechova, Chad Atalla, Su Lin Blodgett, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs,
- Abstract summary: We present a framework for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems.
The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves.
- Score: 78.35388859345056
- License:
- Abstract: Across academia, industry, and government, there is an increasing awareness that the measurement tasks involved in evaluating generative AI (GenAI) systems are especially difficult. We argue that these measurement tasks are highly reminiscent of measurement tasks found throughout the social sciences. With this in mind, we present a framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves. This four-level approach differs from the way measurement is typically done in ML, where researchers and practitioners appear to jump straight from background concepts to measurement instruments, with little to no explicit systematization in between. As well as surfacing assumptions, thereby making it easier to understand exactly what the resulting measurements do and do not mean, this framework has two important implications for evaluating evaluations: First, it can enable stakeholders from different worlds to participate in conceptual debates, broadening the expertise involved in evaluating GenAI systems. Second, it brings rigor to operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.
Related papers
- Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems [88.35461485731162]
We identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms.
Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs.
arXiv Detail & Related papers (2024-11-23T22:13:38Z) - Measuring Human and AI Values based on Generative Psychometrics with Large Language Models [13.795641564238434]
In recent advances in AI, large language models (LLMs) have emerged as both tools and subjects of value measurement.
This work introduces Generative Psychometrics for Values (GPV), a data-driven value measurement paradigm grounded in text-revealed selective perceptions.
arXiv Detail & Related papers (2024-09-18T16:26:22Z) - Comprehensive Equity Index (CEI): Definition and Application to Bias Evaluation in Biometrics [47.762333925222926]
We present a novel metric to quantify biased behaviors of machine learning models.
We focus on and apply it to the operational evaluation of face recognition systems.
arXiv Detail & Related papers (2024-09-03T14:19:38Z) - Between Randomness and Arbitrariness: Some Lessons for Reliable Machine Learning at Scale [2.50194939587674]
dissertation: quantifying and mitigating sources of arbitiness in ML, randomness in uncertainty estimation and optimization algorithms, in order to achieve scalability without sacrificing reliability.
dissertation serves as an empirical proof by example that research on reliable measurement for machine learning is intimately bound up with research in law and policy.
arXiv Detail & Related papers (2024-06-13T19:29:37Z) - Evaluating General-Purpose AI with Psychometrics [43.85432514910491]
We discuss the need for a comprehensive and accurate evaluation of general-purpose AI systems such as large language models.
Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems.
To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation.
arXiv Detail & Related papers (2023-10-25T05:38:38Z) - A Domain-Agnostic Approach for Characterization of Lifelong Learning
Systems [128.63953314853327]
"Lifelong Learning" systems are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability.
We show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems.
arXiv Detail & Related papers (2023-01-18T21:58:54Z) - Measuring Data [79.89948814583805]
We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets.
Data measurements quantify different attributes of data along common dimensions that support comparison.
We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice.
arXiv Detail & Related papers (2022-12-09T22:10:46Z) - Active Inference in Robotics and Artificial Agents: Survey and
Challenges [51.29077770446286]
We review the state-of-the-art theory and implementations of active inference for state-estimation, control, planning and learning.
We showcase relevant experiments that illustrate its potential in terms of adaptation, generalization and robustness.
arXiv Detail & Related papers (2021-12-03T12:10:26Z) - Measurement as governance in and for responsible AI [0.0]
Measurement of social phenomena is everywhere, unavoidably, in sociotechnical systems.
We use the language of measurement to uncover hidden governance decisions.
We then explore the constructs of fairness, robustness, and responsibility in the context of governance in and for responsible AI.
arXiv Detail & Related papers (2021-09-13T01:04:22Z) - An Objective Metric for Explainable AI: How and Why to Estimate the
Degree of Explainability [3.04585143845864]
We present a new model-agnostic metric to measure the Degree of eXplainability of correct information in an objective way.
We designed a few experiments and a user-study on two realistic AI-based systems for healthcare and finance.
arXiv Detail & Related papers (2021-09-11T17:44:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.