Fugu-MT 論文翻訳(概要): Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

論文の概要: Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

arxiv url: http://arxiv.org/abs/2605.06213v1
Date: Thu, 07 May 2026 13:15:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.823084
Title: Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models
Title（参考訳）: 修正ベンチマークと最悪のケースアタックを超えて:言語モデルの動的境界評価
Authors: Haoxiang Wang, Da Yu, Huishuai Zhang,
Abstract要約: 本稿では,動的境界評価(DBE)を提案する。これは各モデルの境界を積極的に把握し,グローバルに匹敵する難易度尺度に配置する。 DBEは、3つのアーティファクトを提供する: (i) 安全性、能力、真実性をカバーした校正項目銀行で、9ドルの基準LCMで検証された難易度ラベル付きで、 (ii) スキルガイド境界探索(SGBS)、 (ii) APIレベルクエリアクセスのみを使用して、所定のターゲットLSMの境界項目を見つける検索アルゴリズム、 (iii) 新しいLCMを統一能力尺度に配置し、ターゲットが外に落ちたときに適応的に評価セットを拡大する評価プロトコル。
参考スコア（独自算出の注目度）: 20.61766907174782
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the boundary, where the per-prompt pass probability is near $0.5$ under random-sampling decoding, and propose Dynamic Boundary Evaluation (DBE), which actively locates each model's boundary and places it on a globally comparable difficulty scale. DBE delivers three artifacts: (i) a calibrated item bank covering safety, capability, and truthfulness, with per-item difficulty labels validated across $9$ reference LLMs; (ii) Skill-Guided Boundary Search (SGBS), a search algorithm that finds boundary items for a given target LLM using only API-level query access; and (iii) an evaluation protocol that places a new LLM on a unified ability scale and grows the evaluation set adaptively when the target falls outside the bank's coverage. We instantiate DBE on four categories spanning safety (harmful request refusal and over-refusal), capability (constrained instruction following), and truthfulness (multi-turn sycophancy resistance). The resulting evaluation covers a broader model spectrum without saturation while remaining compatible with existing datasets.
Abstract（参考訳）: 大規模言語モデル (LLM) の評価は、どのモデルにも同じアイテムセットを適用し、マスク能力のギャップをもたらす天井効果とフロア効果を生み出す固定ベンチマークに頼っている。確率ごとの通過確率はランダムサンプリング復号法で0.5ドル近くとなる境界に最も有意な評価信号があり,各モデルのバウンダリを積極的に検出し,ほぼ同等の難易度尺度に配置する動的境界評価(DBE)を提案する。 DBEは3つの成果物を提供する。一安全、能力及び真理を網羅する校正品銀行であって、貸出困難ラベルが九十九ドルの基準LLMにあつて検証されているもの (ii)SGBS(Skill-Guided Boundary Search)は,APIレベルのクエリアクセスのみを使用して,所定の目標LLMの境界項目を検出する検索アルゴリズムである。三新たなLCMを統一能力尺度に配置し、目標が銀行のカバレッジ外に落下した場合に適応的に評価セットを拡大する評価プロトコル。 DBEは安全性(有害な要求拒否と過剰な拒絶)、能力(制約された指示に従う)、真実性(マルチターン・サイコファンシー抵抗)の4つのカテゴリでインスタンス化される。結果として得られた評価は、既存のデータセットとの互換性を維持しながら飽和のないより広いモデルスペクトルをカバーする。

論文の概要: Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

関連論文リスト