Fugu-MT 論文翻訳(概要): The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

論文の概要: The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

arxiv url: http://arxiv.org/abs/2606.18656v1
Date: Wed, 17 Jun 2026 03:53:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:50.99224
Title: The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs
Title（参考訳）: 右の誤り:LLMにおける誤火アライメントの定量化と位置化
Authors: Naihao Deng, Yiming Feng, Chimaobi Okite, Kaijian Zou, Lu Wang, Rada Mihalcea, Yulong Chen,
Abstract要約: 大規模言語モデル (LLM) は, 文脈によって明示的にサポートされた場合でも, 保証された結論を拒絶する可能性があることを示す。我々は、この障害モードを、アライメントによって引き起こされる変更によって、LCMが明確な証拠をオーバーライドする、誤ったアライメントと呼ぶ。この現象を定量化するために,2,032 BBQ 由来のコントラスト付きペアからなるベンチマーク VETO を導入し,0 から100 のスケールでモデルがステレオタイプ関連の問題で失敗する頻度を計測する新しい指標 Misfired Alignment Rate (MAR) を定義する。
参考スコア（独自算出の注目度）: 36.56059375552239
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Warning: This paper studies stereotypes and biases, and contains potentially disturbing examples, used for illustration purposes only. Our findings should not be interpreted as an argument against alignment. Instead, this paper highlights the need for principled approaches to more advanced alignment. Alignment aims to ensure that large language models (LLMs) behave safely and reliably, including by avoiding unsafe inferences. However, we show that such safety-oriented behaviors can misfire: models may reject warranted conclusions even when they are explicitly supported by context. We call this failure mode misfired alignment, where alignment-induced changes cause LLMs to override explicit evidence. To quantify this phenomenon, specifically on stereotype-related alignment, we introduce VETO, a benchmark consisting of 2,032 BBQ-derived contrastive pairs, and define a new metric, Misfired Alignment Rate (MAR), which measures on a 0 to 100 scale how often a model fails on a stereotype-related question but succeeds on its contrastive counterpart. We benchmark 25 LLMs on VETO, and show that all LLMs, including the most recent ones, exhibit non-trivial (4.7 to 18.9%) MARs while all human participants achieve 0.0% MAR. Controlled priming experiments further show that alignment-induced cues can substantially amplify MAR across LLMs, indicating that these failures are not merely artifacts of individual examples but can be induced by safety-related framing. Mechanistic analyses on open-weight LLMs reveal late-layer suppression of evidence-supported answers, and comparisons between instruct and base LLMs suggest that this suppression emerges after instruction training. These findings show that current alignment methods can overgeneralize surface-level safety cues, to the point of overriding objective evidence, motivating more work on alignment objectives that better preserve contextual grounding.
Abstract（参考訳）: 警告:本論文はステレオタイプとバイアスを研究し,図示目的のみに用いられる潜在的に乱雑な例を含む。我々の発見は、アライメントに反対する議論として解釈されるべきではない。代わりに、より高度なアライメントに対する原則化されたアプローチの必要性を強調します。アライメントは、安全でない推論を避けることを含め、大きな言語モデル(LLM)が安全かつ確実に振る舞うことを保証することを目的としている。しかし、このような安全指向の行動は、コンテキストによって明示的に支持されている場合でも、モデルが保証された結論を拒否する可能性がある。我々は、この障害モードを、アライメントによって引き起こされる変更によって、LCMが明確な証拠をオーバーライドする、誤ったアライメントと呼ぶ。この現象を,特にステレオタイプ関連アライメントに基づいて定量化するために,2,032個のBBQ由来のコントラスト対からなるベンチマークVETOを導入し,0から100スケールでモデルがステレオタイプ関連問題に失敗する頻度を計測するMAR(Misfired Alignment Rate)を新たに定義する。 VETO 上で 25 個の LLM をベンチマークし、最新の LLM を含む全ての LLM が非自明な (4.7 から 18.9%) の MAR を示し、すべての人間の参加者が 0.0% の MAR を達成することを示した。制御されたプライミング実験により、アライメントによって誘導されるキューは、LSM全体にわたってMARを実質的に増幅できることが示され、これらの失敗は単なる個々の例の人工物ではなく、安全性に関するフレーミングによって引き起こされることを示している。オープンウェイト LLM の力学解析により,エビデンス支援回答の後期抑制が明らかになり,インストラクションとベース LLM の比較から,この抑制が指導訓練後に現れることが示唆された。これらの結果から, 現状のアライメント手法は, 地上レベルの安全手段を過度に一般化し, 客観的エビデンスを過度に覆い, 文脈的接地をよりよく保存するアライメント目標への取り組みを動機付けることが示唆された。

論文の概要: The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

関連論文リスト