Fugu-MT 論文翻訳(概要): The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

論文の概要: The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

arxiv url: http://arxiv.org/abs/2510.17388v1
Date: Mon, 20 Oct 2025 10:26:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:12.037054
Title: The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives
Title（参考訳）: 原子インストラクションギャップ:シンプルで自己完結型ディレクティブを用いた命令調整LDM
Authors: Henry Lim, Kwan Hui Lim,
Abstract要約: Instruction-tuned large language model (IT-LLM) は強いゼロショット推論を示す。 MMLUとMMLU-Proのベンチマークを用いて20個のIT-LLMを評価した。
参考スコア（独自算出の注目度）: 7.085868567930685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.
Abstract（参考訳）: 命令調整型大規模言語モデル(IT-LLM)は、強いゼロショット推論を示すが、複雑な命令追従の基礎となっているにもかかわらず、単純で自己完結型の命令を実行する能力は未探索のままである。 MMLU と MMLU-Pro ベンチマークの20 IT-LLM を,オプションラベルの形式を体系的に変更し,その意味を4つのパラダイムで同一に保ちながら評価した。 2)指示がなければ、パフォーマンスはさらに低下し(最大10.84\%)、ラベルの感度が向上し、明示的なガイダンスの役割が強調される。 (3)オプションの内容が削除された場合、モデルは数値ラベルを除いてランダム選択ベースラインを失敗し、原子ディレクティブへの弱い付着を示唆する。 (4) 3ショット例では, 強靭性や忠実性に有意な利得は得られず, 生成解析では, 特に非数値形式において, ラベルエラーが持続的であった。モデルサイズ全体にわたって、より大きなLLMは高い精度を達成するが、命令順守には矛盾しない。これらの結果は、現在の命令学習パラダイムが不十分であることを明らかにし、原子命令追跡を明示的に対象とする評価方法や訓練戦略の必要性を強調している。

論文の概要: The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

関連論文リスト