Fugu-MT 論文翻訳(概要): Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

論文の概要: Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

arxiv url: http://arxiv.org/abs/2603.18015v1
Date: Tue, 24 Feb 2026 16:27:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:42.388249
Title: Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection
Title（参考訳）: Beyond Accuracy: 説明責任駆動型分析による有害コンテンツ検出
Authors: Trishita Dhara, Siddhesh Sheth,
Abstract要約: Civil Commentsデータセットでトレーニングされた神経有害コンテンツ検出モデルを、説明可能性駆動型で分析する。 2つの一般的なポストホックな説明手法、Shapley Additive ExplanationsとIntegrated Gradientsが使用されている。曲線0.93の領域と精度0.94の領域が強い総合的な性能にもかかわらず、この分析は総合評価指標だけでは観測できない限界を明らかにしている。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although automated harmful content detection systems are frequently used to monitor online platforms, moderators and end users frequently cannot understand the logic underlying their predictions. While recent studies have focused on increasing classification accuracy, little focus has been placed on comprehending why neural models identify content as harmful, especially when it comes to borderline, contextual, and politically sensitive situations. In this work, a neural harmful content detection model trained on the Civil Comments dataset is analyzed explainability-drivenly. Two popular post-hoc explanation methods, Shapley Additive Explanations and Integrated Gradients, are used to analyze the behavior of a RoBERTa-based classifier in both correct predictions and systematic failure cases. Despite strong overall performance, with an area under the curve of 0.93 and an accuracy of 0.94, the analysis reveals limitations that are not observable from aggregate evaluation metrics alone. Integrated Gradients appear to extract more diffuse contextual attributions while Shapley Additive Explanations extract more focused attributions on explicit lexical cues. The consequent divergence in their outputs manifests in both false negatives and false positives. Qualitative case studies reveal recurring failure modes such as indirect toxicity, lexical over-attribution, or political discourse. The results suggest that explainable AI can foster human-in-the-loop moderation by exposing model uncertainty and increasing the interpretable rationale behind automated decisions. Most importantly, this work highlights the role of explainability as a transparency and diagnostic resource for online harmful content detection systems rather than as a performance-enhancing lever.
Abstract（参考訳）: 自動化された有害コンテンツ検出システムは、オンラインプラットフォームを監視するために頻繁に使用されるが、モデレーターやエンドユーザは、予測の根底にあるロジックをよく理解できない。近年の研究では、分類精度の向上に焦点が当てられているが、特に境界線、文脈、政治的にセンシティブな状況において、なぜ神経モデルがコンテンツが有害であるかを理解することにはほとんど焦点が当てられていない。本研究では、Civil Commentsデータセットでトレーニングされたニューラルネットワークによる有害コンテンツ検出モデルを、説明可能性駆動型で分析する。 2つの一般的なポストホックな説明手法、Shapley Additive ExplanationsとIntegrated Gradientsは、RoBERTaベースの分類器の動作を、正しい予測と系統的な障害ケースの両方で分析するために使用される。曲線0.93の領域と精度0.94の領域が強い総合的な性能にもかかわらず、この分析は総合評価指標だけでは観測できない限界を明らかにしている。統合的なグラディエントはより拡散した文脈的帰属を抽出し、シェープな加法的説明は明示的な語彙的帰属をより集中した帰属を抽出する。結果の相違は偽陰性と偽陽性の両方に現れる。定性的なケーススタディでは、間接毒性、語彙過剰寄与、政治的言論のような繰り返し発生する障害モードが示される。その結果、モデルの不確実性を露呈し、自動決定の背後にある解釈可能な理論的根拠を増大させることにより、説明可能なAIは、ループ中の人間のモデレーションを促進することが示唆された。最も重要なことは、この研究は、パフォーマンス向上レバーではなく、オンラインの有害コンテンツ検出システムにおける透明性と診断リソースとしての説明可能性の役割を強調している。

論文の概要: Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

関連論文リスト