Fugu-MT 論文翻訳(概要): When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

論文の概要: When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

arxiv url: http://arxiv.org/abs/2604.15038v1
Date: Thu, 16 Apr 2026 14:07:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.93669
Title: When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning
Title（参考訳）: フェアネスメトリクスが診断されたとき--機械学習におけるデモグラフィックフェアネスアセスメントの信頼性の評価
Authors: Khalid Adnan Alsayed,
Abstract要約: 機械学習モデルにおける階層バイアスの系統的マルチメトリック分析を行うことにより、公平性評価の整合性を検討する。結果から,評価値の妥当性は指標の選択によって大きく異なっており,モデルバイアスに関する矛盾した結論が得られた。これらの知見は、現在の公正度評価の実践において重要な限界を浮き彫りにしており、信頼性の高いバイアス評価にはシングルメトリックレポートが不十分であることを示唆している。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evaluation of fairness in machine learning systems has become a central concern in high-stakes applications, including biometric recognition, healthcare decision-making, and automated risk assessment. Existing approaches typically rely on a small number of fairness metrics to assess model behaviour across group partitions, implicitly assuming that these metrics provide consistent and reliable conclusions. However, different fairness metrics capture distinct statistical properties of model performance and may therefore produce conflicting assessments when applied to the same system. In this work, we investigate the consistency of fairness evaluation by conducting a systematic multi-metric analysis of demographic bias in machine learning models. Using face recognition as a controlled experimental setting, we evaluate model performance across multiple group partitions under a range of commonly used fairness metrics, including error-rate disparities and performance-based measures. Our results demonstrate that fairness assessments can vary significantly depending on the choice of metrics, leading to contradictory conclusions regarding model bias. To quantify this phenomenon, we introduce the Fairness Disagreement Index (FDI), a measure designed to capture the degree of inconsistency across fairness metrics. We further show that disagreement remains high across thresholds and model configurations. These findings highlight a critical limitation in current fairness evaluation practices and suggest that single-metric reporting is insufficient for reliable bias assessment.
Abstract（参考訳）: 機械学習システムにおける公正性の評価は、バイオメトリック認識、医療意思決定、自動リスクアセスメントなど、高度な応用において中心的な関心事となっている。既存のアプローチは通常、グループパーティション間のモデル動作を評価するために少数の公正度メトリクスに依存しており、これらのメトリクスが一貫性と信頼性のある結論を提供すると暗黙的に仮定している。しかし、異なる公正度尺度はモデル性能の異なる統計特性を捉え、従って同じシステムに適用した場合に矛盾する評価を生じさせる可能性がある。本研究では,機械学習モデルにおける階層バイアスの系統的マルチメトリック分析を行うことにより,公平性評価の整合性を検討する。顔認証を制御された実験環境として使用し,複数のグループ分割におけるモデル性能を,誤差率の相違や性能に基づく測定値を含む,一般的なフェアネス尺度の範囲で評価する。結果から,評価値の妥当性は指標の選択によって大きく異なっており,モデルバイアスに関する矛盾した結論が得られた。この現象を定量化するために、フェアネス指標間での不整合度を捉えるために設計されたFDI(Fairness Disagreement Index)を導入する。さらに、閾値とモデル構成の相違が依然として高いことを示す。これらの知見は、現在の公正度評価の実践において重要な限界を浮き彫りにしており、信頼性の高いバイアス評価にはシングルメトリックレポートが不十分であることを示唆している。

論文の概要: When Fairness Metrics Disagree: Evaluating the Reliability of Demographic Fairness Assessment in Machine Learning

関連論文リスト