Fugu-MT 論文翻訳(概要): Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

論文の概要: Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

arxiv url: http://arxiv.org/abs/2605.14517v1
Date: Thu, 14 May 2026 08:00:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.699296
Title: Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
Title（参考訳）: 大規模言語モデルの次元レベルインテント忠実度評価:構造化プロンプトアブレーションによる証拠
Authors: GAng Peng,
Abstract要約: 全体的評価スコアは、全体の出力品質をキャプチャするが、モデルがユーザの要求の構造形式を再現したかどうかを区別しない。本稿では,2,880個の出力に対して構造化されたプロンプトアブレーション研究を通じて,次元レベルのインテント忠実度評価フレームワークを提案する。
参考スコア（独自算出の注目度）: 0.585480332059272
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user's request from whether it preserved the user's specific intent. We propose a dimension-level intent fidelity evaluation framework, applied here through a structured prompt ablation study across 2,880 outputs spanning three languages, three task domains, and six LLMs, that separately measures structural recovery and intent fidelity for each semantic dimension. This framework reveals a systematic structural-fidelity split: among Chinese-language outputs with complete paired scores, 25.7% received perfect holistic alignment scores (GA=5) while exhibiting measurable dimensional intent deficits; among English-language outputs, this proportion rose to 58.6%. Human evaluation confirmed that these split-zone outputs represent genuine quality deficits and that dimensional fidelity scores track human judgements more reliably than holistic scores do. A public-private decomposition of 2,520 ablation cells characterises when models successfully compensate for missing intent and when they fail, while proxy annotation distinguishes prior inferability from default recoverability. A weight-perturbation experiment shows that moderate misalignment is typically absorbed, whereas severe dimensional inversion is consistently harmful. These findings demonstrate that dimension-level intent fidelity evaluation is a necessary complement to holistic assessment when evaluating LLM outputs for user-specific tasks.
Abstract（参考訳）: 全体的評価スコアは、全体の出力品質をキャプチャするが、モデルがユーザの要求の構造形式を再現したかどうかと、ユーザの特定の意図を保存したかどうかを区別しない。本稿では,3つの言語,3つのタスク領域,6つのLLMにまたがる2,880個のアウトプットに対して,各セマンティックディメンションに対する構造的リカバリと意図忠実度を別々に測定する,構造化されたプロンプトアブレーション研究を通じて,次元レベルのインテント忠実度評価フレームワークを提案する。この枠組みは、完全なペアのスコアを持つ中国語の出力のうち25.7%が完全な全体的アライメントスコア(GA=5)を受け、測定可能な次元の意図的欠陥を示しており、英語の出力では58.6%まで上昇している。人間の評価では、これらの分割ゾーンのアウトプットは真の品質の欠陥を表し、次元の忠実度スコアは、総合的なスコアよりも人間の判断をより確実に追跡することが確認された。パブリックプライベートな2,520個のアブレーションセルの分解は、モデルが欠落意図を補うのに成功し、失敗したときに特徴付けられるが、プロキシアノテーションは、事前の推論可能性とデフォルトの回復可能性とを区別する。重量摂動実験では、適度な不整合は通常吸収されるが、重い次元の逆転は一貫して有害である。これらの結果から,LLM出力をユーザ固有のタスクで評価する場合,次元レベルの意図の忠実度評価が総合評価に欠かせないことが示唆された。

論文の概要: Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation

関連論文リスト