Fugu-MT 論文翻訳(概要): Fair and Calibrated Toxicity Detection with Robust Training and Abstention

論文の概要: Fair and Calibrated Toxicity Detection with Robust Training and Abstention

arxiv url: http://arxiv.org/abs/2605.14074v1
Date: Wed, 13 May 2026 19:50:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.488221
Title: Fair and Calibrated Toxicity Detection with Robust Training and Abstention
Title（参考訳）: ロバストトレーニングと回避による公平かつ校正された毒性検出
Authors: Mokshit Surana,
Abstract要約: トレーニングタイムの介入やポストホックの安全メカニズムは独立して評価することはできない。経験的リスク最小化(ERM)、インスタンスレベルの再重み付け、グループDROをこれらの軸で比較する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention. Training-time interventions and post-hoc safety mechanisms cannot be evaluated independently because the former determines the efficacy of the latter. We compare Empirical Risk Minimization (ERM), instance-level reweighting, and Group DRO across these axes, combined with temperature scaling, confidence-based abstention, and per-identity threshold optimization. Evaluation uses subgroup AUC, BPSN/BNSP AUC, error gaps, and per-subgroup Expected Calibration Error (ECE) with bootstrap CIs ($n = 1000$). We report four findings. (1) Calibration disparity is a hidden fairness violation. ERM has near-perfect aggregate calibration ($0.013$) but is significantly miscalibrated across all identity subgroups ($+0.029$ to $+0.134$). (2) Training interventions reshape rather than eliminate disparity. Reweighted ERM improves ranking (BPSN AUC $+0.06$ to $+0.12$) but worsens the calibration-fairness gap by up to $+0.232$. Group DRO eliminates calibration disparity but only by becoming uniformly miscalibrated globally (ECE $0.118$). (3) Post-hoc methods inherit training failure modes. Temperature scaling fails because miscalibration is non-uniform. Confidence-based abstention works under ERM but breaks under DRO, where the risk-coverage curve rises with deferral. (4) Abstention itself is unfair. Confidence-based deferral helps background content far more than identity-mentioning content. We argue that SRAI fairness requires a multi-axis framework: methods that differ only in aggregate ranking can differ sharply in failure modes that determine real-world harm.
Abstract（参考訳）: 毒性分類の公正性には、ランク付け、校正、棄権の3つの統合された軸が含まれる。前者が後者の有効性を決定するため、トレーニング時間介入やポストホック安全メカニズムを独立して評価することはできない。経験的リスク最小化(ERM)、インスタンスレベルの再重み付け、グループDROをこれらの軸で比較し、温度スケーリング、信頼に基づく禁忌、アイデンティティごとのしきい値最適化と組み合わせた。評価にはサブグループAUC、BPSN/BNSP AUC、エラーギャップ、およびブートストラップCI(n = 1000$)によるサブグループごとのキャリブレーションエラー(ECE)を使用する。我々は4つの発見を報告した。 1) 校正格差は、隠された公平性違反である。 ERMは、ほぼ完全なアグリゲーションキャリブレーション(0.013ドル)を持っているが、すべてのアイデンティティサブグループ(+0.029ドルから$+0.134ドル)でかなり誤解されている。 2)格差をなくすのではなく、トレーニングの介入が作り直される。 Reweighted ERMはランキング(BPSN AUC $+0.06$から$+0.12$)を改善するが、キャリブレーションとフェアネスのギャップを最大で0.232$まで悪化させる。グループDROはキャリブレーションの格差をなくすが、全世界で一様に校正される(ECE$0.118$)。 (3) ポストホック法はトレーニング失敗モードを継承する。温度のスケーリングは、誤校正が一様でないため失敗する。信頼に基づく禁忌はERMの下では機能するが、DROの下では破壊され、そこではリスク被覆曲線は遅延とともに上昇する。 (4)棄権そのものは不公平である。信頼に基づく推論は、身元確認コンテンツよりも背景コンテンツに役立っている。 SRAIの公正性には多軸フレームワークが必要である,と我々は主張する。

論文の概要: Fair and Calibrated Toxicity Detection with Robust Training and Abstention

関連論文リスト