Related papers: Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems

Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems

URL: http://arxiv.org/abs/2506.04482v1
Date: Wed, 04 Jun 2025 22:01:31 GMT
Title: Understanding and Meeting Practitioner Needs When Measuring Representational Harms Caused by LLM-Based Systems
Authors: Emma Harvey, Emily Sheng, Su Lin Blodgett, Alexandra Chouldechova, Jean Garcia-Gathright, Alexandra Olteanu, Hanna Wallach,
Abstract summary: We find that practitioners are often unable to use publicly available instruments for measuring representational harms.<n>In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure.<n>In other cases, instruments are not used by practitioners due to practical and institutional barriers impeding their uptake.
Score: 88.35461485731162
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The NLP research community has made publicly available numerous instruments for measuring representational harms caused by large language model (LLM)-based systems. These instruments have taken the form of datasets, metrics, tools, and more. In this paper, we examine the extent to which such instruments meet the needs of practitioners tasked with evaluating LLM-based systems. Via semi-structured interviews with 12 such practitioners, we find that practitioners are often unable to use publicly available instruments for measuring representational harms. We identify two types of challenges. In some cases, instruments are not useful because they do not meaningfully measure what practitioners seek to measure or are otherwise misaligned with practitioner needs. In other cases, instruments - even useful instruments - are not used by practitioners due to practical and institutional barriers impeding their uptake. Drawing on measurement theory and pragmatic measurement, we provide recommendations for addressing these challenges to better meet practitioner needs.

Related papers

ACEBench: Who Wins the Match Point in Tool Usage? [68.54159348899891]
ACEBench is a comprehensive benchmark for assessing tool usage in Large Language Models (LLMs)<n>It categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent.<n>It provides a more granular examination of error causes across different data types.
arXiv Detail & Related papers (2025-01-22T12:59:08Z)
Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems [88.35461485731162]
We identify four types of challenges that prevent practitioners from effectively using publicly available instruments for measuring representational harms. Our goal is to advance the development of instruments for measuring representational harms that are well-suited to practitioner needs.
arXiv Detail & Related papers (2024-11-23T22:13:38Z)
Demystifying Large Language Models for Medicine: A Primer [50.83806796466396]
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare. This tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice.
arXiv Detail & Related papers (2024-10-24T15:41:56Z)
Learning to Ask: When LLM Agents Meet Unclear Instruction [55.65312637965779]
Large language models (LLMs) can leverage external tools for addressing a range of tasks unattainable through language skills alone.<n>We evaluate the performance of LLMs tool-use under imperfect instructions, analyze the error patterns, and build a challenging tool-use benchmark called Noisy ToolBench.<n>We propose a novel framework, Ask-when-Needed (AwN), which prompts LLMs to ask questions to users whenever they encounter obstacles due to unclear instructions.
arXiv Detail & Related papers (2024-08-31T23:06:12Z)
Large Language Models Must Be Taught to Know What They Don't Know [97.90008709512921]
We show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead.<n>We also investigate the mechanisms that enable reliable uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators.
arXiv Detail & Related papers (2024-06-12T16:41:31Z)
Truthful Meta-Explanations for Local Interpretability of Machine Learning Models [10.342433824178825]
We present a local meta-explanation technique which builds on top of the truthfulness metric, which is a faithfulness-based metric. We demonstrate the effectiveness of both the technique and the metric by concretely defining all the concepts and through experimentation.
arXiv Detail & Related papers (2022-12-07T08:32:04Z)
Undesirable Biases in NLP: Addressing Challenges of Measurement [1.7126708168238125]
We provide an interdisciplinary approach to discussing the issue of NLP model bias by adopting the lens of psychometrics. We will explore two central notions from psychometrics, the construct validity and the reliability of measurement tools. Our goal is to provide NLP practitioners with methodological tools for designing better bias measures.
arXiv Detail & Related papers (2022-11-24T16:53:18Z)
Exploring How Machine Learning Practitioners (Try To) Use Fairness Toolkits [35.7895677378462]
We investigate how industry practitioners (try to) work with existing fairness toolkits. We identify several opportunities for fairness toolkits to better address practitioner needs. We highlight implications for the design of future open-source fairness toolkits.
arXiv Detail & Related papers (2022-05-13T23:07:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.