Related papers: Evaluating Generative AI Systems is a Social Science Measurement Challenge

Evaluating Generative AI Systems is a Social Science Measurement Challenge

URL: http://arxiv.org/abs/2411.10939v1
Date: Sun, 17 Nov 2024 02:35:30 GMT
Title: Evaluating Generative AI Systems is a Social Science Measurement Challenge
Authors: Hanna Wallach, Meera Desai, Nicholas Pangakis, A. Feder Cooper, Angelina Wang, Solon Barocas, Alexandra Chouldechova, Chad Atalla, Su Lin Blodgett, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, Abigail Z. Jacobs,
Abstract summary: We present a framework for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves.
Score: 78.35388859345056
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Across academia, industry, and government, there is an increasing awareness that the measurement tasks involved in evaluating generative AI (GenAI) systems are especially difficult. We argue that these measurement tasks are highly reminiscent of measurement tasks found throughout the social sciences. With this in mind, we present a framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves. This four-level approach differs from the way measurement is typically done in ML, where researchers and practitioners appear to jump straight from background concepts to measurement instruments, with little to no explicit systematization in between. As well as surfacing assumptions, thereby making it easier to understand exactly what the resulting measurements do and do not mean, this framework has two important implications for evaluating evaluations: First, it can enable stakeholders from different worlds to participate in conceptual debates, broadening the expertise involved in evaluating GenAI systems. Second, it brings rigor to operational debates by offering a set of lenses for interrogating the validity of measurement instruments and their resulting measurements.

Related papers

Towards Measurement Theory for Artificial Intelligence [0.6526824510982799]
We argue that formalising measurement for AI will allow researchers, practitioners, and regulators to: (i) make comparisons between systems and the evaluation methods applied to them; (ii) connect frontier AI evaluations with established quantitative risk analysis techniques drawn from engineering and safety science; and (iii) foreground how what counts as AI capability is contingent upon the measurement operations and scales we elect to use.<n>We sketch a layered measurement stack, distinguish direct from indirect observables, and signpost how these ingredients provide a pathway toward a unified, calibratable taxonomy of AI phenomena.
arXiv Detail & Related papers (2025-07-08T01:52:37Z)
Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement [16.608577295968942]
Review paper introduces and synthesizes the emerging interdisciplinary field of LLM Psychometrics.<n>Psychometrics quantify intangible aspects of human psychology, such as personality, values, and intelligence.<n>Ultimately, the review provides actionable insights for developing future evaluation paradigms that align with human-level AI.
arXiv Detail & Related papers (2025-05-13T05:47:51Z)
Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z)
Position: Evaluating Generative AI Systems is a Social Science Measurement Challenge [78.35388859345056]
We argue that the ML community would benefit from learning from and drawing on the social sciences when developing measurement instruments for evaluating GenAI systems. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI.
arXiv Detail & Related papers (2025-02-01T21:09:51Z)
A Shared Standard for Valid Measurement of Generative AI Systems' Capabilities, Risks, and Impacts [38.66213773948168]
The valid measurement of generative AI (GenAI) systems' capabilities, risks, and impacts forms the bedrock of our ability to evaluate these systems. We introduce a shared standard for valid measurement that helps place many of the disparate-seeming evaluation practices in use today on a common footing.
arXiv Detail & Related papers (2024-12-02T19:50:00Z)
Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems [88.35461485731162]
We identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms. Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs.
arXiv Detail & Related papers (2024-11-23T22:13:38Z)
Measuring Human and AI Values based on Generative Psychometrics with Large Language Models [13.795641564238434]
In recent advances in AI, large language models (LLMs) have emerged as both tools and subjects of value measurement. This work introduces Generative Psychometrics for Values (GPV), a data-driven value measurement paradigm grounded in text-revealed selective perceptions.
arXiv Detail & Related papers (2024-09-18T16:26:22Z)
Comprehensive Equity Index (CEI): Definition and Application to Bias Evaluation in Biometrics [47.762333925222926]
We present a novel metric to quantify biased behaviors of machine learning models. We focus on and apply it to the operational evaluation of face recognition systems.
arXiv Detail & Related papers (2024-09-03T14:19:38Z)
Between Randomness and Arbitrariness: Some Lessons for Reliable Machine Learning at Scale [2.50194939587674]
dissertation: quantifying and mitigating sources of arbitiness in ML, randomness in uncertainty estimation and optimization algorithms, in order to achieve scalability without sacrificing reliability. dissertation serves as an empirical proof by example that research on reliable measurement for machine learning is intimately bound up with research in law and policy.
arXiv Detail & Related papers (2024-06-13T19:29:37Z)
Evaluating General-Purpose AI with Psychometrics [43.85432514910491]
We discuss the need for a comprehensive and accurate evaluation of general-purpose AI systems such as large language models. Current evaluation methodology, mostly based on benchmarks of specific tasks, falls short of adequately assessing these versatile AI systems. To tackle these challenges, we suggest transitioning from task-oriented evaluation to construct-oriented evaluation.
arXiv Detail & Related papers (2023-10-25T05:38:38Z)
A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems [128.63953314853327]
"Lifelong Learning" systems are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability. We show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems.
arXiv Detail & Related papers (2023-01-18T21:58:54Z)
Measuring Data [79.89948814583805]
We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Data measurements quantify different attributes of data along common dimensions that support comparison. We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice.
arXiv Detail & Related papers (2022-12-09T22:10:46Z)
Active Inference in Robotics and Artificial Agents: Survey and Challenges [51.29077770446286]
We review the state-of-the-art theory and implementations of active inference for state-estimation, control, planning and learning. We showcase relevant experiments that illustrate its potential in terms of adaptation, generalization and robustness.
arXiv Detail & Related papers (2021-12-03T12:10:26Z)
Measurement as governance in and for responsible AI [0.0]
Measurement of social phenomena is everywhere, unavoidably, in sociotechnical systems. We use the language of measurement to uncover hidden governance decisions. We then explore the constructs of fairness, robustness, and responsibility in the context of governance in and for responsible AI.
arXiv Detail & Related papers (2021-09-13T01:04:22Z)
An Objective Metric for Explainable AI: How and Why to Estimate the Degree of Explainability [3.04585143845864]
We present a new model-agnostic metric to measure the Degree of eXplainability of correct information in an objective way. We designed a few experiments and a user-study on two realistic AI-based systems for healthcare and finance.
arXiv Detail & Related papers (2021-09-11T17:44:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.