Fugu-MT 論文翻訳(概要): Compared to What? Baselines and Metrics for Counterfactual Prompting

論文の概要: Compared to What? Baselines and Metrics for Counterfactual Prompting

arxiv url: http://arxiv.org/abs/2605.01048v1
Date: Fri, 01 May 2026 19:23:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.556778
Title: Compared to What? Baselines and Metrics for Counterfactual Prompting
Title（参考訳）: 対物プロンプティングの基準と基準
Authors: Zihao Yang, Mosh Levy, Yoav Goldberg, Byron C. Wallace,
Abstract要約: 患者の性別を外科的に変化させると,MedQAの14.9%のフリップ率を予測する。本稿では,目標介入下で観察される相違点と,パラフレーズ入力によって引き起こされる相違点を比較検討する枠組みを提案する。一般モデル感度を考慮すると,これらの効果は大きく消散することがわかった。
参考スコア（独自算出の注目度）: 39.56472929066589
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Counterfactual prompting (i.e., perturbing a single factor and measuring output change) is widely used to evaluate things like LLM bias and CoT faithfulness. But in this work we argue that observed effects cannot be attributed to the targeted factor without accounting for baseline ``meaning-preserving'' modifications to text that establish general model sensitivity. This is because every counterfactual edit is a compound treatment that bundles the variable of interest with incidental surface-form variation; this violates treatment variation irrelevance. We observe prediction flip rates on MedQA of 14.9% when we surgically change patient gender. However, this is statistically indistinguishable from the flip rates induced by simply paraphrasing inputs (14.1%). In this case, it would therefore be unwarranted to conclude that the LLM is especially sensitive to patient gender. To account for this and robustly measure the effects of targeted interventions, we propose a framework in which we compare (via statistical testing) differences observed under target interventions to those induced by paraphrasing inputs. We then use this framework to revisit a analysis done on the MedPerturb dataset, which reported evidence of model sensitivity to patient demographics and stylistic cues. We find that these effects largely dissipate when we account for general model sensitivity, with only 5 of 120 tests reaching statistical significance. Applying the same framework to occupational biography classification, we detect clearly significant directional gender bias, showing that the framework identifies real directional effects even when they are small. We evaluate a range of metrics -- aggregate, per-sample distributional, and regression -- and find that per-sample metrics are dramatically more powerful than aggregate metrics and regression powerfully and uniquely characterizes effect direction and magnitude.
Abstract（参考訳）: LLMバイアスやCoT忠実度などを評価するために、対物的プロンプト(すなわち、単一因子の摂動と出力変化の測定)が広く用いられている。しかし、本研究では、一般的なモデル感度を確立するテキストに対する ` `` meaning-preserving'' の修正を基準にすることなく、観測された効果は対象因子に起因することはできないと論じる。これは、すべての反事実的編集が、興味の変数を付随的な表面形態の変化に束ねる複合的な処理であるためである。患者の性別を外科的に変化させると,MedQAの14.9%のフリップ率を予測する。しかし、これは単に入力を言い換えることによって引き起こされるフリップレート(14.1%)と統計的に区別できない。この場合、LSMは特に患者の性別に敏感であると結論付けることは不適当である。そこで,本研究では,対象介入の効果を定量的に測定するために,対象介入下で観測された相違点と,パラフレージング入力によって誘発される相違点を比較検討する枠組みを提案する。次に、このフレームワークを使用して、MedPerturbデータセットで行った分析を再検討し、患者人口統計学とスタイリスティックな手がかりに対するモデル感度の証拠を報告した。一般モデル感度を考慮した場合,120点中5点のみが統計的に有意な値を示した。同一の枠組みを職業的伝記分類に適用することにより、明らかに有意な指向性バイアスを検知し、その枠組みが小さい場合でも実際の指向性効果を認識することを示す。我々は、集約、サンプルごとの分布、回帰といった様々な指標を評価し、サンプルごとのメトリクスは、集約されたメトリクスよりも劇的に強力であり、回帰は、効果の方向と大きさを強力かつ一意に特徴づける。

論文の概要: Compared to What? Baselines and Metrics for Counterfactual Prompting

関連論文リスト