Fugu-MT 論文翻訳(概要): Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities

論文の概要: Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities

arxiv url: http://arxiv.org/abs/2601.18554v1
Date: Mon, 26 Jan 2026 15:02:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-27 15:23:08.882921
Title: Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities
Title（参考訳）: インストラクションのデコンストラクション-フォロー:大規模言語モデルインストラクションコンプライアンス能力のグラニュラー評価のための新しいベンチマーク
Authors: Alberto Purpura, Li Wang, Sahil Badyal, Eugenio Beaufrand, Adam Faulkner,
Abstract要約: 既存のベンチマークでは、実際の使用を反映したり、コンプライアンスをタスクの成功から分離することができない。アプリケーション指向の生成制約を最大20個まで含む動的に生成されたデータセットを使用するモジュール型フレームワークであるMOSAICを紹介した。コンプライアンスはモノリシックな機能ではなく、制約タイプ、量、位置によって大きく異なります。
参考スコア（独自算出の注目度）: 2.9203730377983654
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliably ensuring Large Language Models (LLMs) follow complex instructions is a critical challenge, as existing benchmarks often fail to reflect real-world use or isolate compliance from task success. We introduce MOSAIC (MOdular Synthetic Assessment of Instruction Compliance), a modular framework that uses a dynamically generated dataset with up to 20 application-oriented generation constraints to enable a granular and independent analysis of this capability. Our evaluation of five LLMs from different families based on this new benchmark demonstrates that compliance is not a monolithic capability but varies significantly with constraint type, quantity, and position. The analysis reveals model-specific weaknesses, uncovers synergistic and conflicting interactions between instructions, and identifies distinct positional biases such as primacy and recency effects. These granular insights are critical for diagnosing model failures and developing more reliable LLMs for systems that demand strict adherence to complex instructions.
Abstract（参考訳）: 大規模言語モデル(LLM)が複雑な命令に従うことを確実に保証することは、重要な課題である。 MoSAIC(Modular Synthetic Assessment of Instruction Compliance)は、アプリケーション指向の最大20の制約付き動的に生成されたデータセットを使用して、この機能の粒度と独立した分析を可能にするモジュラーフレームワークである。このベンチマークに基づいて, 異なる家系の5つのLCMを評価した結果, コンプライアンスはモノリシックな機能ではなく, 制約タイプ, 量, 位置によって大きく異なることが示された。この分析は、モデル固有の弱点を明らかにし、命令間の相乗的および矛盾する相互作用を明らかにし、プライマリシーや回帰効果などの異なる位置バイアスを識別する。これらの詳細な洞察は、モデルの失敗を診断し、複雑な命令への厳密な従順性を要求するシステムのためのより信頼性の高いLCMを開発するために重要である。

論文の概要: Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities

関連論文リスト