Related papers: Towards a multi-stakeholder value-based assessment framework for algorithmic systems

Towards a multi-stakeholder value-based assessment framework for algorithmic systems

URL: http://arxiv.org/abs/2205.04525v1
Date: Mon, 9 May 2022 19:28:32 GMT
Title: Towards a multi-stakeholder value-based assessment framework for algorithmic systems
Authors: Mireia Yurrita, Dave Murray-Rust, Agathe Balayn, Alessandro Bozzon
Abstract summary: We develop a value-based assessment framework that visualizes closeness and tensions between values. We give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.
Score: 76.79703106646967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In an effort to regulate Machine Learning-driven (ML) systems, current auditing processes mostly focus on detecting harmful algorithmic biases. While these strategies have proven to be impactful, some values outlined in documents dealing with ethics in ML-driven systems are still underrepresented in auditing processes. Such unaddressed values mainly deal with contextual factors that cannot be easily quantified. In this paper, we develop a value-based assessment framework that is not limited to bias auditing and that covers prominent ethical principles for algorithmic systems. Our framework presents a circular arrangement of values with two bipolar dimensions that make common motivations and potential tensions explicit. In order to operationalize these high-level principles, values are then broken down into specific criteria and their manifestations. However, some of these value-specific criteria are mutually exclusive and require negotiation. As opposed to some other auditing frameworks that merely rely on ML researchers' and practitioners' input, we argue that it is necessary to include stakeholders that present diverse standpoints to systematically negotiate and consolidate value and criteria tensions. To that end, we map stakeholders with different insight needs, and assign tailored means for communicating value manifestations to them. We, therefore, contribute to current ML auditing practices with an assessment framework that visualizes closeness and tensions between values and we give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.

Related papers

SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection. Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains. We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z)
Bridging the Gap in XAI-Why Reliable Metrics Matter for Explainability and Compliance [2.3020018305241337]
The paper emphasizes the critical gap in the evaluation of Explainable AI (XAI) due to the lack of standardized and reliable metrics. Current evaluation methods are often fragmented, subjective, and biased, making them prone to manipulation and complicating the assessment of complex models. We advocate for widespread research into developing robust, context-sensitive evaluation metrics.
arXiv Detail & Related papers (2025-02-07T06:54:48Z)
Decoding AI Judgment: How LLMs Assess News Credibility and Bias [33.7054351451505]
Large Language Models (LLMs) are increasingly embedded in that involve evaluative processes.<n>This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans.<n>We benchmark six LLMs against expert ratings--NewsGuard and Media Bias/Fact Check (MBFC)--and against human judgments collected through a controlled experiment.
arXiv Detail & Related papers (2025-02-06T18:52:10Z)
Value Compass Leaderboard: A Platform for Fundamental and Validated Evaluation of LLMs Values [76.70893269183684]
Large Language Models (LLMs) achieve remarkable breakthroughs, aligning their values with humans has become imperative. Existing evaluations focus narrowly on safety risks such as bias and toxicity. Existing benchmarks are prone to data contamination. The pluralistic nature of human values across individuals and cultures is largely ignored in measuring LLMs value alignment.
arXiv Detail & Related papers (2025-01-13T05:53:56Z)
Pessimistic Evaluation [58.736490198613154]
We argue that evaluating information access systems assumes utilitarian values not aligned with traditions of information access based on equal access. We advocate for pessimistic evaluation of information access systems focusing on worst case utility.
arXiv Detail & Related papers (2024-10-17T15:40:09Z)
A Conceptual Framework for Ethical Evaluation of Machine Learning Systems [12.887834116390358]
Ethical implications appear when designing evaluations of machine learning systems. We present a utility framework, characterizing the key trade-off in ethical evaluation as balancing information gain against potential ethical harms. Our analysis underscores the critical need for development teams to deliberately assess and manage ethical complexities.
arXiv Detail & Related papers (2024-08-05T01:06:49Z)
Top-K Pairwise Ranking: Bridging the Gap Among Ranking-Based Measures for Multi-Label Classification [120.37051160567277]
This paper proposes a novel measure named Top-K Pairwise Ranking (TKPR) A series of analyses show that TKPR is compatible with existing ranking-based measures. On the other hand, we establish a sharp generalization bound for the proposed framework based on a novel technique named data-dependent contraction.
arXiv Detail & Related papers (2024-07-09T09:36:37Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making. We present a process-based benchmark MR-Ben that demands a meta-reasoning skill. Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Pragmatic auditing: a pilot-driven approach for auditing Machine Learning systems [5.26895401335509]
We present a respective procedure that extends the AI-HLEG guidelines published by the European Commission. Our audit procedure is based on an ML lifecycle model that explicitly focuses on documentation, accountability, and quality assurance. We describe two pilots conducted on real-world use cases from two different organisations.
arXiv Detail & Related papers (2024-05-21T20:40:37Z)
Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments. It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z)
Fairness in Contextual Resource Allocation Systems: Metrics and Incompatibility Results [7.705334602362225]
We study systems that allocate scarce resources to satisfy basic needs, such as homeless services that provide housing. These systems often support communities disproportionately affected by systemic racial, gender, or other injustices. We propose a framework for evaluating fairness in contextual resource allocation systems inspired by fairness metrics in machine learning.
arXiv Detail & Related papers (2022-12-04T02:30:58Z)
A Framework for Auditing Multilevel Models using Explainability Methods [2.578242050187029]
An audit framework for technical assessment of regressions is proposed. The focus is on three aspects, model, discrimination, and transparency and explainability. It is demonstrated that popular explainability methods, such as SHAP and LIME, underperform in accuracy when interpreting these models.
arXiv Detail & Related papers (2022-07-04T17:53:21Z)
Evaluation Gaps in Machine Learning Practice [13.963766987258161]
In practice, evaluations of machine learning models frequently focus on a narrow range of decontextualized predictive behaviours. We examine the evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations. By studying these properties, we demonstrate the machine learning discipline's implicit assumption of a range of commitments which have normative impacts.
arXiv Detail & Related papers (2022-05-11T04:00:44Z)
Towards Quantifiable Dialogue Coherence Evaluation [126.55560816209756]
Quantifiable Dialogue Coherence Evaluation (QuantiDCE) is a novel framework aiming to train a quantifiable dialogue coherence metric. QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning. Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.
arXiv Detail & Related papers (2021-06-01T14:11:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.