Related papers: Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems

Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems

URL: http://arxiv.org/abs/2411.15662v1
Date: Sat, 23 Nov 2024 22:13:38 GMT
Title: Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems
Authors: Emma Harvey, Emily Sheng, Su Lin Blodgett, Alexandra Chouldechova, Jean Garcia-Gathright, Alexandra Olteanu, Hanna Wallach,
Abstract summary: We identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms. Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs.
Score: 88.35461485731162
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To facilitate the measurement of representational harms caused by large language model (LLM)-based systems, the NLP research community has produced and made publicly available numerous measurement instruments, including tools, datasets, metrics, benchmarks, annotation instructions, and other techniques. However, the research community lacks clarity about whether and to what extent these instruments meet the needs of practitioners tasked with developing and deploying LLM-based systems in the real world, and how these instruments could be improved. Via a series of semi-structured interviews with practitioners in a variety of roles in different organizations, we identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms caused by LLM-based systems: (1) challenges related to using publicly available measurement instruments; (2) challenges related to doing measurement in practice; (3) challenges arising from measurement tasks involving LLM-based systems; and (4) challenges specific to measuring representational harms. Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs, thus better facilitating the responsible development and deployment of LLM-based systems.

Related papers

Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications [25.38031971196831]
Large language models (LLMs) are increasingly used in human-centered tasks. Assessing their psychological traits is crucial for understanding their social impact and ensuring trustworthy AI alignment. This study aims to propose future directions for developing more interpretable, robust, and generalizable psychological assessment frameworks for LLMs.
arXiv Detail & Related papers (2025-04-30T06:09:40Z)
ACEBench: Who Wins the Match Point in Tool Usage? [68.54159348899891]
ACEBench is a comprehensive benchmark for assessing tool usage in Large Language Models (LLMs) It categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. It provides a more granular examination of error causes across different data types.
arXiv Detail & Related papers (2025-01-22T12:59:08Z)
Evaluating Generative AI Systems is a Social Science Measurement Challenge [78.35388859345056]
We present a framework for measuring concepts related to the capabilities, impacts, opportunities, and risks of GenAI systems. The framework distinguishes between four levels: the background concept, the systematized concept, the measurement instrument(s), and the instance-level measurements themselves.
arXiv Detail & Related papers (2024-11-17T02:35:30Z)
Demystifying Large Language Models for Medicine: A Primer [50.83806796466396]
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare. This tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice.
arXiv Detail & Related papers (2024-10-24T15:41:56Z)
Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently. Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting. We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z)
A Framework for Automated Measurement of Responsible AI Harms in Generative AI Applications [15.087045120842207]
We present a framework for the automated measurement of responsible AI (RAI) metrics for large language models (LLMs) Our framework for automatically measuring harms from LLMs builds on existing technical and sociotechnical expertise. We use this framework to run through several case studies investigating how different LLMs may violate a range of RAI-related principles.
arXiv Detail & Related papers (2023-10-26T19:45:06Z)
Identifying Concerns When Specifying Machine Learning-Enabled Systems: A Perspective-Based Approach [1.2184324428571227]
PerSpecML is a perspective-based approach for specifying ML-enabled systems. It helps practitioners identify which attributes, including ML and non-ML components, are important to contribute to the overall system's quality.
arXiv Detail & Related papers (2023-09-14T18:31:16Z)
Auditing large language models: a three-layered approach [0.0]
Large language models (LLMs) represent a major advance in artificial intelligence (AI) research. LLMs are also coupled with significant ethical and social challenges. Previous research has pointed towards auditing as a promising governance mechanism.
arXiv Detail & Related papers (2023-02-16T18:55:21Z)
Measuring Data [79.89948814583805]
We identify the task of measuring data to quantitatively characterize the composition of machine learning data and datasets. Data measurements quantify different attributes of data along common dimensions that support comparison. We conclude with a discussion of the many avenues of future work, the limitations of data measurements, and how to leverage these measurement approaches in research and practice.
arXiv Detail & Related papers (2022-12-09T22:10:46Z)
Truthful Meta-Explanations for Local Interpretability of Machine Learning Models [10.342433824178825]
We present a local meta-explanation technique which builds on top of the truthfulness metric, which is a faithfulness-based metric. We demonstrate the effectiveness of both the technique and the metric by concretely defining all the concepts and through experimentation.
arXiv Detail & Related papers (2022-12-07T08:32:04Z)
Understanding the Usability Challenges of Machine Learning In High-Stakes Decision Making [67.72855777115772]
Machine learning (ML) is being applied to a diverse and ever-growing set of domains. In many cases, domain experts -- who often have no expertise in ML or data science -- are asked to use ML predictions to make high-stakes decisions. We investigate the ML usability challenges present in the domain of child welfare screening through a series of collaborations with child welfare screeners.
arXiv Detail & Related papers (2021-03-02T22:50:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.