Fugu-MT 論文翻訳(概要): Exploring and Testing Skill-Based Behavioral Profile Annotation: Human Operability and LLM Feasibility under Schema-Guided Execution

論文の概要: Exploring and Testing Skill-Based Behavioral Profile Annotation: Human Operability and LLM Feasibility under Schema-Guided Execution

arxiv url: http://arxiv.org/abs/2604.14843v1
Date: Thu, 16 Apr 2026 10:25:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.840725
Title: Exploring and Testing Skill-Based Behavioral Profile Annotation: Human Operability and LLM Feasibility under Schema-Guided Execution
Title（参考訳）: スキルに基づく行動プロファイルアノテーションの探索と検証--スキーマガイドによる実行下での人的操作性とLCMの実現可能性
Authors: Yufeng Wu,
Abstract要約: 振舞いプロファイル(BP)アノテーションは、複数の言語的次元を同時にコーディングする必要があるため、自動化が難しい。スキルファイル駆動パイプラインを実装して,スキーマファイルや決定ルール,例を通じて,各機能を外部的に定義する。オープンソースの失敗は、スキーマからスキルへの実行の問題に集中している。
参考スコア（独自算出の注目度）: 2.545461559283292
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Behavioral Profile (BP) annotation is difficult to automate because it requires simultaneous coding across multiple linguistic dimensions. We treat BP annotation as a bundle of annotation skills rather than a single task and evaluate LLM-assisted BP annotation from this perspective. Using 3,134 concordance lines of 30 Chinese metaphorical color-term derivatives and a 14-feature BP schema, we implement a skill-file-driven pipeline in which each feature is externally defined through schema files, decision rules, and examples. Two human annotators completed a two-round schema-only protocol on a 300-instance validation subset, enabling BP skills to be classified as directly operable, recoverable under focused re-annotation, or structurally underspecified. GPT-5.4 and three locally deployable open-source models were then evaluated under the same setup. Results show that BP annotation is highly heterogeneous at the skill level: 5 skills are directly operable, 4 are recoverable after focused re-annotation, and 5 remain structurally underspecified. GPT-5.4 executes the retained skills with substantial reliability (accuracy = 0.678, \k{appa} = 0.665, weighted F1 = 0.695), but this feasibility is selective rather than global. Human and GPT difficulty profiles are strongly aligned at the skill level (r = 0.881), but not at the instance level (r = 0.016) or lexical-item level (r = -0.142), a pattern we describe as shared taxonomy, independent execution. Pairwise agreement further suggests that GPT is better understood as an independent third skill voice than as a direct human substitute. Open-source failures are concentrated in schema-to-skill execution problems. These findings suggest that automatic annotation should be evaluated in terms of skill feasibility rather than task-level automation.
Abstract（参考訳）: 振舞いプロファイル(BP)アノテーションは、複数の言語的次元を同時にコーディングする必要があるため、自動化が難しい。我々は,BPアノテーションを単一タスクではなくアノテーションスキルのバンドルとして扱うとともに,この観点からLPM支援BPアノテーションを評価する。 3,134行の中国の比喩的カラー終端微分と14機能BPスキーマを用いて、各機能をスキーマファイル、決定ルール、例を通して外部的に定義するスキルファイル駆動パイプラインを実装した。 2人のアノテータが300インスタンスの検証サブセット上で2ラウンドのスキーマのみのプロトコルを完了し、BPスキルを直接操作可能、集中した再アノテーションの下で復元可能、あるいは構造的に過小評価されるようにした。 GPT-5.4と3つのローカルにデプロイ可能なオープンソースモデルは同じ設定で評価された。 BPアノテーションは,5つのスキルが直接操作可能であり,4つのスキルが集中的再アノテーションで回復可能であり,5つは構造的に不明確である。 GPT-5.4は、かなりの信頼性(精度: 0.678, \k{appa} = 0.665, 重み: F1 = 0.695)で保持されたスキルを実行するが、この実現性はグローバルではなく選択的である。人間とGPTの難易度プロファイルはスキルレベル (r = 0.881) で強く一致しているが、インスタンスレベル (r = 0.016) や語彙レベル (r = -0.142) では一致しない。ペアワイズ合意はさらに、GPTは直接の人間の代用としてよりも独立した第三のスキル音声として理解されていることを示唆している。オープンソースの失敗は、スキーマからスキルへの実行の問題に集中している。これらの結果から,自動アノテーションはタスクレベルの自動化よりも,スキルの実現可能性の観点から評価されるべきであることが示唆された。

論文の概要: Exploring and Testing Skill-Based Behavioral Profile Annotation: Human Operability and LLM Feasibility under Schema-Guided Execution

関連論文リスト